In [None]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


This code example generates feature vectors corresponding to images in the input dataset. These feature vectors are used by the image captioning network in the code example v9_6_image_captioning. More context for this code example can be found in video 9.6 "Programming Example: Image Captioning with PyTorch" in the video series "Learning Deep Learning: From Perceptron to Large Language Models" by Magnus Ekman (Video ISBN-13: 9780138177614). This is notebook 1 of 2 for this example.


This programming example assumes that the following resources from the COCO dataset are available:
The file captions_train2014.json should be located in the directory ../data/coco/
All the training images should be located in the directory ../data/coco/train2014/

The resulting feature vectors will be stored in the directory pt_data/feature_vectors/

The import statements are shown in the first code snippet below.


In [None]:
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms
from PIL import Image
import json
import numpy as np
import pickle
import gzip

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
TRAINING_FILE_DIR = '../data/coco/'
OUTPUT_FILE_DIR = 'pt_data/feature_vectors/'


The parts of the dataset that we will use are contained in two resources. The first resource is a json file that contains captions as well as filenames and some other information for the images. We make the assumption that you have placed that file in the directory pointed to by the variable TRAINING_FILE_DIR. The images themselves are stored as individual image files and are assumed to be located in a directory named train2014 in the directory pointed to by TRAINING_FILE_DIR. The COCO dataset contains elaborate tools to parse and read the rich information about the various images, but because we are only interested in the image captions, we choose to directly access the json file and extract the limited data that we need ourselves. The code snippet below opens the json file and creates a dictionary that, for each image, maps a unique key to a list of strings. The first string in each list represents the image filename, and the subsequent strings are alternative captions for the image.


In [None]:
with open(TRAINING_FILE_DIR \
          + 'captions_train2014.json') as json_file:
    data = json.load(json_file)
image_dict = {}
for image in data['images']:
    image_dict[image['id']] = [image['file_name']]
for anno in data['annotations']:
    image_dict[anno['image_id']].append(anno['caption'])


The next step is to create our pretrained VGG19 model, which is done in the next code snippet. We first obtain the full pretrained VGG19 model. We then create a new model but drop the fully connected layers at the top of the model. Looking at the code, it is non-obvious how we drop multiple layers. It turns out that the layers are grouped into three blocks of layers. The first block contains convolutional layers and pooling layers. The second and third blocks contain the fully connected layers. That is, by selecting only block 0, we drop a number of layers. We then drop the last layer in block 0, which is a max-pooling layer. That is, the output from our new model is the top-most convolutional layer from the original model.

We then transfer this new model to the GPU.


In [None]:
# Create network without top layers.
model = torchvision.models.vgg19(weights='DEFAULT')
model_blocks = list(model.children())
layers = list(model_blocks[0].children())
model = nn.Sequential(*layers[0:-1])
model.eval()

# Transfer model to GPU
model.to(device)


We are now ready to run all the images through the network and extract the feature vectors and save to disk. This is done by the code snippet below. We traverse the dictionary to obtain the image file names. Every loop iteration does the processing for a single image and saves the feature vectors for that one image in a single file. Before running the image through the network, we perform some preprocessing. The image sizes in the COCO dataset vary from image to image, so we first resize it so the shortest side is 256 pixels, and then we crop the center 224Ã—224 region of the resulting image. We also normalize the pixel values using mean and standard deviation documented at pytortch.org.

Next we run the image through the network. The output of the network will be a tensor with the shape (1, 14, 14, 512) representing the results from a batch of images where the first dimension indicates that the batch size is 1. Therefore, we extract the first (and only) element from this tensor and convert it to a NumPy array that we save as a gzipped pickle file with the same name as the image but with the extension .pickle.gz in the directory feature_vectors. When we have looped through all images, we also save the dictionary file as caption_file. pickle.gz so we do not need to parse the json file again later in the code that does the actual training.


In [None]:
# Run all images through the network and save the output.
for i, key in enumerate(image_dict.keys()):
    if i % 1000 == 0:
        print('Progress: ' + str(i) + ' images processed')
    item = image_dict.get(key)
    filename = TRAINING_FILE_DIR + 'train2014/' + item[0]

    # Load and preprocess image.
    # Resize so shortest side is 256 pixels.
    # Crop to center 224x224 region.
    image = Image.open(filename).convert('RGB')
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    input_tensor = preprocess(image)

    # Rearrange array to have one more
    # dimension representing batch size = 1.
    inputs = input_tensor.unsqueeze(0)

    # Call model and save resulting tensor to disk.
    inputs = inputs.to(device)
    with torch.no_grad():
        y = model(inputs)[0].cpu().numpy()
    save_filename = OUTPUT_FILE_DIR + \
        item[0] + '.pickle.gzip'
    pickle_file = gzip.open(save_filename, 'wb')
    pickle.dump(y, pickle_file)
    pickle_file.close()

# Save the dictionary containing captions and filenames.
save_filename = OUTPUT_FILE_DIR + 'caption_file.pickle.gz'
pickle_file = gzip.open(save_filename, 'wb')
pickle.dump(image_dict, pickle_file)
pickle_file.close()
