# Visual Question Answering

In this notebook I used two different model architectures to solve the visual question answering problem.

The images are processed with the VGG19 (stripped of the softmax layer). The result is a vector of 4096 values.

The questions are processed with GloVe, the result is a vector of 300 values per each word, I used a fixed length of 50 tokens per question.

The first model uses LSTM units to model the questions and a FC takes the image vector and the LSTM output as an input.

The second model uses two FC networks: the first one takes the flattened question matrix ((50, 300) -> 15000) and reduces its size to a more manageable number of values. The second one takes this resulting vector and concatenates it to the image representation to make predictions.

Due to personal time constraints I wasn't able to tune appropriately the hyperparameters and I've also been forced to keep batch sizes and steps per epoch low. Furthermore I was not able to try other approaches.

The results are similar for the two different architectures, the maximum test accuracy was 0.25 for both architectures.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import os
import json
from datetime import datetime
import random

from PIL import Image
import spacy

import tensorflow as tf
import numpy as np

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import *

SEED = 1234
tf.random.set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

cwd = os.getcwd()

# Parameters

In [3]:
height = 224
width = 224
shape = (height, width, 3)

bs = 16 # Batch size

question_max_length = 50

num_classes = 13

In [4]:
train_file = "dataset_vqa/train_data.json"
train_images_dir = "dataset_vqa/train/"

test_file = "dataset_vqa/test_data.json"
test_images_dir = "dataset_vqa/test/"

In [5]:
validation_split = 0.1

with open(train_file) as f:
    data = json.load(f)
    
indices = list(range(len(data["questions"])))
random.shuffle(indices)

train_length = int(len(data["questions"])*(1-validation_split))

train_indices = indices[:train_length]
validation_indices = indices[train_length:]

steps_per_epoch = int(len(train_indices) / bs)
validation_steps = int(len(validation_indices) / bs)

# Auxiliary models

In [6]:
# Image model for extracting a vector of features
# VGG19

image_model = tf.keras.applications.VGG19(input_shape = shape,
                                          include_top = True,
                                          weights = 'imagenet')
image_model = Model(image_model.input, image_model.layers[-2].output)
image_model.trainable = False
image_model.summary()

image_features_size = image_model.layers[-1].output_shape[1]

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0     

In [7]:
def get_images_features(image_model, filenames):
    """
    Returns the vectors of features of images given an image model and the filenames list.
    
    Parameters:
        image_model (keras model): pre-trained keras model that given an image outputs a vector of features (e.g. VGG16 without last layer).
        filenames ([str]): list of strings representing the images from this folder.
        
    Returns:
        a list of feature vectors (one per image).
    """
    
    images = []
    for fn in filenames:
        img = Image.open(fn)
            
        img = img.resize((height, width))
        img = np.array(img)[:,:,:3]
        images.append(img)
    
    images = tf.convert_to_tensor(images)
    return tf.convert_to_tensor(image_model.predict(images))

In [8]:
# Word embedding model
# GloVe

word_model = spacy.load("en_core_web_lg")
word_features_size = 300

In [9]:
def get_questions_features(word_model, questions):
    """
    Returns the vectors of features of questions.
    
    Parameters:
        word_model: word embedding model (e.g. "en_core_web_lg").
        questions ([str]): list of questions as strings.
        
    Returns:
        a list of Glove vectors.
    """
    
    questions_features = np.zeros((len(questions), question_max_length, word_features_size))
    
    for q in range(len(questions)):
        tokens = word_model(questions[q])
        for i in range(len(tokens)):
            questions_features[q, i, :] = tokens[i].vector
        
    return tf.convert_to_tensor(questions_features)

# Dataset loading

In [10]:
encode_dict = {
    '0': 0,
    '1': 1,
    '10': 2,
    '2': 3,
    '3': 4,
    '4': 5,
    '5': 6,
    '6': 7,
    '7': 8,
    '8': 9,
    '9': 10,
    'no': 11,
    'yes': 12
}

def encode(y):
    y = encode_dict[y]
    return tf.one_hot(y, num_classes)

In [11]:
def generator(indices, batch_size = bs, flatten_questions = False):
    """
    A dataset generator
    Parameters:
        indices ([int]): list of indices of the set as found in the train_data.json file.
        batch_size (int): batch size.
        flatten_questions (bool): optionally the question matrix of each question can be flattened into a single vector (default False). 
    
    Returns:
        the dataset generator.
    """
    
    while True:
        
        batch_indices = np.random.choice(a = indices, size = batch_size)
        
        with open(train_file) as f:
            data = json.load(f)
            data = data["questions"]
        
        questions = []
        images_filenames = []
        answers = []
        
        for i in batch_indices:
            q = data[i]
            questions.append(q["question"])
            images_filenames.append(train_images_dir + q["image_filename"])
            answers.append(encode(q["answer"]))
        
        images_features = get_images_features(image_model, images_filenames)
        questions_features = get_questions_features(word_model, questions)
        
        if flatten_questions:
            questions_features = tf.reshape(questions_features, [len(batch_indices), question_max_length*word_features_size])
        
        batch_x = [questions_features, images_features]
        batch_y = np.array(answers)
        
        yield (batch_x, batch_y)

In [12]:
train_generator = generator(train_indices, bs)
validation_generator = generator(validation_indices, bs)

train_generator_alt = generator(train_indices, bs, flatten_questions = True)
validation_generator_alt = generator(validation_indices, bs, flatten_questions = True)

# Model and training

In [13]:
def get_model(LSTM_units, FC_units, FC_dropout=None):
    """
    Returns a Keras model given some parameters.
    
    Parameters:
        LSTM_units ([int]): number of LSTM units in each layer.
        FC_units ([int]): number of units in each FC layer.
        FC_dropout (float): dropout for the FC layers (default None)
        
    Returns:
        a Keras model.
    """
    
    # Image
    in_image = Input((image_features_size,))
    
    in_language = Input((question_max_length, word_features_size))
    model_language = in_language
    for i in range(len(LSTM_units)):
        if i < len(LSTM_units) - 1:
            model_language = LSTM(LSTM_units[i], return_sequences = True)(model_language)
        else:
            model_language = LSTM(LSTM_units[i], return_sequences = False)(model_language)
    
    model = Concatenate()([model_language, in_image])
    
    for i in range(len(FC_units)):
        model = Dense(FC_units[i], activation = 'relu')(model)
        if FC_dropout:
            model = Dropout(FC_dropout)(model)
    
    model = Dense(num_classes, activation = 'softmax')(model)
    
    model = Model([in_language, in_image], model)

    return model

In [14]:
def get_model_alt(FC1_units, FC2_units,
                  FC1_dropout = None, FC1_activation = 'relu', 
                  FC2_dropout = None, FC2_activation = 'relu'):
    """
    Returns a Keras model given some parameters.
    
    Parameters:
        FC1_units ([int]): number of units in each FC_1 layer (question processing).
        FC2_units ([int]): number of units in each FC_2 layer (question + image processing).
        FC1_dropout (float): dropout for the FC_1 layers (default None).
        FC1_activation (str): name of the activation function used in the FC_1 layers.
        FC2_dropout (float): dropout for the FC_2 layers (default None).
        FC2_activation (str): name of the activation function used in the FC_2 layers. 
        
    Returns:
        a Keras model.
    """
    
    
    in_image = Input((image_features_size,))
    in_language = Input((question_max_length * word_features_size))
    
    model_language = in_language
    for i in range(len(FC1_units)):
        model_language = Dense(FC1_units[i], activation = FC1_activation)(model_language)
        if FC1_dropout:
            model_language = Dropout(FC1_dropout)(model_language)
            
    model = Concatenate()([model_language, in_image])
    
    for i in range(len(FC2_units)):
        model = Dense(FC2_units[i], activation = FC2_activation)(model)
        if FC2_dropout:
            model = Dropout(FC2_dropout)(model)
            
    model = Dense(num_classes, activation = 'softmax')(model)
    
    model = Model([in_language, in_image], model)
    
    return model

In [15]:
def fit_model(model, model_name = datetime.now().strftime('%b%d_%H-%M-%S'), train_gen = train_generator, validation_gen = validation_generator):
    """
    Function used to fit the model (and save the checkpoints).
    It saves all the checkpoints that increased the performance and returns the best one.
    The performance evaluated is the loss.
    Early stopping is used with 10 epochs of patience.
    It also uses a tensorboard callback for visualization.
    
    Parameters
    ----------
    model: keras model
        model to fit.
    model_name: string, optional
        name of the model.
    
    Returns
    -------
    keras model: the best model.
    string: the directory of the model.
    """
    
    cwd = os.getcwd()
    
    # General experiments folder
    exps_dir = os.path.join(cwd, 'vqa_experiments')
    if not os.path.exists(exps_dir):
        os.makedirs(exps_dir)
    
    now = datetime.now().strftime('%b%d_%H-%M-%S')
    
    # This experiment folder
    exp_dir = os.path.join(exps_dir, model_name + '_' + str(now))
    if not os.path.exists(exp_dir):
        os.makedirs(exp_dir)
    
    # Checpoints folder
    ckpt_dir = os.path.join(exp_dir, 'ckpts')
    if not os.path.exists(ckpt_dir):
        os.makedirs(ckpt_dir)
    
    # Tensorboard folder
    tb_dir = os.path.join(exp_dir, 'tb_logs')
    if not os.path.exists(tb_dir):
        os.makedirs(tb_dir)
    
    # Checkpoints callback, best one will be the last saved
    ckpt_callback = tf.keras.callbacks.ModelCheckpoint(filepath=os.path.join(ckpt_dir, 'cp_{epoch:02d}.ckpt'), 
                                                       save_weights_only=True,
                                                       save_best_only=True)
    
    # Tensorboard callback
    tb_callback = tf.keras.callbacks.TensorBoard(log_dir=tb_dir,
                                                 profile_batch=0,
                                                 histogram_freq=1)  # if 1 shows weights histograms
    
    # Early stopping callback
    es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    
    callbacks= [ckpt_callback, tb_callback, es_callback]
    
    model.fit_generator(generator=train_gen,
                        epochs=50,
                        steps_per_epoch=100,
                        validation_data=validation_gen,
                        validation_steps=10,
                        callbacks=callbacks,
                        use_multiprocessing=False,
                        workers=1)
    
    # Load best model (last one saved)
    latest = tf.train.latest_checkpoint(ckpt_dir)
    print("Latest model: " + latest)
    model.load_weights(os.path.join(ckpt_dir, latest))
    
    return (model, exp_dir)

In [16]:
def create_csv(results, results_dir='./'):
    """
    Function used to write a prediction dictionary to a csv file
    
    Parameters
    ----------
    results: dict
        predictions
    results_dir: string, optional
        the directory
    """

    csv_fname = 'results_'
    csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

    with open(os.path.join(results_dir, csv_fname), 'w') as f:

        f.write('Id,Category\n')

        for key, value in results.items():
            f.write(str(key) + ',' + str(value) + '\n')

In [17]:
def predict(model, exp_dir, flatten_questions = False):
    """
    Function used to make predictions and write them to file.
    
    Parameters
    ----------
    model: keras model
    exp_dir: the directory where the predictions must be saved.
    """
    
    results = {}
    
    print("Evaluating the test set...")
    
    with open(test_file) as f:
        data = json.load(f)
        data = data["questions"]
        
    num_questions = len(data)
    
    for i in range(num_questions):
        q = data[i]
        question = [q["question"]]
        image_filename = [test_images_dir + q["image_filename"]]
        
        question_features = get_questions_features(word_model, question)
        if flatten_questions:
            question_features = tf.reshape(question_features, [1, question_max_length*word_features_size])
            
        image_features = get_images_features(image_model, image_filename)
        
        prediction = model.predict([question_features, image_features])
        prediction = np.argmax(prediction)
        
        question_id = q["question_id"]
        
        progress = (question_id / num_questions) * 100
        
        if progress % 10 == 0:
            print("Progress: " + str(int(progress)) + "%")
        
        results[question_id] = prediction
    
    print("DONE")
    
    create_csv(results = results, results_dir=exp_dir)

In [18]:
def run_model(LSTM_units, FC_units, FC_dropout = None):
    """
    Function used to group all the operations needed to create a LSTM+FC model, compile it, fit it and make predictions.
    """
    
    model_name = "LSTM_u="
    for u in LSTM_units:
        model_name += str(u) + "_"
    model_name += "FC_u="
    for u in FC_units:
        model_name += str(u) + "_"
    if FC_dropout:
        model_name += "FC_do" + str(FC_dropout).replace("0.", "p")
        
    model = get_model(LSTM_units = LSTM_units, 
                      FC_units = FC_units, 
                      FC_dropout = FC_dropout)
    
    loss = tf.keras.losses.CategoricalCrossentropy()
    lr = 1e-2
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    metrics = ['accuracy']
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

    model.summary()
    
    model, m_dir = fit_model(model, model_name)
    
    predict(model, m_dir)

In [19]:
def run_model_alt(FC1_units, FC2_units, 
                  FC1_dropout = None, FC1_activation = 'relu', 
                  FC2_dropout = None, FC2_activation = 'relu'):
    """
    Function used to group all the operations needed to create a FC+FC model, compile it, fit it and make predictions.
    """
    
    model_name = "FC1_u="
    for u in FC1_units:
        model_name += str(u) + "_"
    if FC1_dropout:
        model_name += "do=" + str(FC1_dropout).replace("0.", "p") + "_"
    #model_name += "_a=" + FC1_activation
    model_name += "FC2_u="
    for u in FC2_units:
        model_name += str(u) + "_"
    if FC2_dropout:
        model_name += "_do=" + str(FC2_dropout).replace("0.", "p") + "_"
    #model_name += "_a=" + FC2_activation
    
    model = get_model_alt(FC1_units, FC2_units,
                          FC1_dropout, FC1_activation, 
                          FC2_dropout, FC2_activation)
        
    loss = tf.keras.losses.CategoricalCrossentropy()
    lr = 1e-2
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    metrics = ['accuracy']
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    
    model.summary()
    
    model, m_dir = fit_model(model, model_name, train_gen = train_generator_alt, validation_gen = validation_generator_alt)
    
    predict(model, m_dir, flatten_questions = True)

# Models

# Approach 1

The image is processed with VGG19 to obtain a vector of 4096 values.

The question is turned into a vector of 50 GloVe vectors (one per word + padding).

The question is then fed to some LSTM layers, the output of the last LSTM layer is concatenated with the image vector and this resulting vector is fed to some FC layers.

In [24]:
LSTM_units = [512, 512, 512]
FC_units = [512, 512, 512]

run_model(LSTM_units = LSTM_units,
          FC_units = FC_units)

NameError: name 'LSTM_depth' is not defined

In [20]:
LSTM_units = [512, 512, 512]
FC_units = [512, 512, 512]
FC_dropout = 0.3

run_model(LSTM_units = LSTM_units, 
          FC_units = FC_units, 
          FC_dropout = FC_dropout)

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            [(None, 50, 300)]    0                                            
__________________________________________________________________________________________________
lstm_3 (LSTM)                   (None, 50, 512)      1665024     input_5[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   (None, 50, 512)      2099200     lstm_3[0][0]                     
__________________________________________________________________________________________________
lstm_5 (LSTM)                   (None, 512)          2099200     lstm_4[0][0]                     
____________________________________________________________________________________________

In [18]:
LSTM_units = [512, 512, 512]
FC_units = [512, 512, 512]
FC_dropout = 0.5

run_model(LSTM_units = LSTM_units,
          FC_units = FC_units,
          FC_dropout = FC_dropout)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 50, 300)]    0                                            
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 50, 512)      1665024     input_3[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 50, 512)      2099200     lstm[0][0]                       
__________________________________________________________________________________________________
lstm_2 (LSTM)                   (None, 512)          2099200     lstm_1[0][0]                     
____________________________________________________________________________________________

# Approach 2

The image is processed with VGG19 to obtain a vector of 4096 values.

The question is turned into a vector of 50 GloVe vectors (one per word + padding) and then it's flattened, the result is a vector of 15000 values.

The question is then fed to some FC layers (FC1), the output of the last FC1 layer is concatenated with the image vector and this resulting vector is fed to some other FC layers (FC2).

In [31]:
FC1_units = [512]
FC2_units = [512, 512]
FC1_dropout = 0.3
FC2_dropout = 0.3

run_model_alt(FC1_units = FC1_units, FC2_units = FC2_units
              FC1_dropout = FC1_dropout, FC2_dropout = FC2_dropout)

Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_15 (InputLayer)           [(None, 15000)]      0                                            
__________________________________________________________________________________________________
dense_21 (Dense)                (None, 512)          7680512     input_15[0][0]                   
__________________________________________________________________________________________________
dropout_15 (Dropout)            (None, 512)          0           dense_21[0][0]                   
__________________________________________________________________________________________________
input_14 (InputLayer)           [(None, 4096)]       0                                            
____________________________________________________________________________________________

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x22f95e53348>

Evaluating the test set...
Progress: 0%
Progress: 10%
Progress: 20%
Progress: 30%
Progress: 40%
Progress: 50%
Progress: 60%
Progress: 70%
Progress: 80%
Progress: 90%
DONE


In [23]:
FC1_units = [1024, 512]
FC2_units = [1024, 512]
FC1_dropout = 0.5
FC2_dropout = 0.3

run_model_alt(FC1_units = FC1_units, FC2_units = FC2_units,
              FC1_dropout = FC1_dropout, FC2_dropout = FC2_dropout)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 15000)]      0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 1024)         15361024    input_3[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 1024)         0           dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          524800      dropout[0][0]                    
____________________________________________________________________________________________

In [20]:
FC1_units = [4096]
FC2_units = [512, 512]
FC1_dropout = 0.5
FC2_dropout = 0.3

run_model_alt(FC1_units = FC1_units, FC2_units = FC2_units,
              FC1_dropout = FC1_dropout, FC2_dropout = FC2_dropout)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 15000)]      0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 4096)         61444096    input_3[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 4096)         0           dense[0][0]                      
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 4096)]       0                                            
____________________________________________________________________________________________