# Competition 3 - Visual Query Answering
The competition 3 is about both convolutional and recurrent neural networks. 

Here we have a huge dataset (with respect to the previous two competitions) composed by a reasonable number of images containing spheres, cones or cubes of different colors and materials and a ton of questions about the number, the colors or the materials of the objects.

We need to use Tensorflow 2.0 (with keras) in order to perform image classification on the images and to perform word understanding on the questions in the dataset; the final network will mix the two neural networks in order to understand the question, the relative image and the answer. 

The final goal is to train the network in order to learn how to answer to new queries on new images containing the same kind of objects.

---

We tried different approaches and we will present the 2 main final ones (the output of this notebook, however, will be not the same we had in the challenge for computer performance reasons).

> Remember that this notebook ran on kaggle, if you download the dataset change the paths used in the code

First of all, we import all the needed libraries.

In [2]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow as tf
import numpy as np
import json
import cv2
import math 
from matplotlib import pyplot as plt
from tensorflow.keras.utils import Sequence
import os
from datetime import datetime
import random

Definetely we can not load all the 60k images in one single shot, if we try to do it we get a memory error from tensorflow that will not have more free RAM to store them.

The solution is to define a class "DataGenerator" in order to load the needed images at each STEP in the train process. This class will be used by tensorflow to load the data.

In [3]:
class DataGenerator(Sequence):
    """Generates data for Keras
    Sequence based data generator. Suitable for building data generator for training and prediction.
    """
    def __init__(self, list_IDs, image_path, train_input_questions, max_length,
                 to_fit=True, batch_size=16, dim=(100, 150),
                 n_channels=3, n_classes=13, shuffle=True):
        """Initialization
        :param list_IDs: list of all 'label' ids to use in the generator  
        = ANSWERS to the questions!
        
        :param image_path: path to images location 
        = IMAGES path!
        
        :param to_fit: True to return X and y, False to return X only 
        = TRUE for val e train, FALSE for test
        
        :param batch_size: batch size at each iteration 
        
        :param dim: tuple indicating image dimension
        
        :param n_channels: number of image channels 
        = 3 for RGB images
        
        :param n_classes: number of output masks 
        
        :param shuffle: True to shuffle label indexes after every epoch 
        = always true!
        """
        self.list_IDs = list_IDs
        self.train_input_questions = train_input_questions
        self.image_path = image_path
        self.to_fit = to_fit
        self.batch_size = batch_size
        self.dim = dim
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.img_h = dim[0]
        self.img_w = dim[1]
        self.max_length = max_length
        self.on_epoch_end()

    def __len__(self):
        """Denotes the number of batches per epoch
        :return: number of batches per epoch
        """
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        """Generate one batch of data
        :param index: index of the batch
        :return: X and y when fitting. X only when predicting
        """
        # Generate indexes of the batch
        indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X = self._generate_X(list_IDs_temp)

        if self.to_fit:
            y = self._generate_y(list_IDs_temp)
            return X, y
        else:
            return X

    def on_epoch_end(self):
        """Updates indexes after each epoch
        """
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def _generate_X(self, list_IDs_temp):
        """Generates data containing batch_size images
        :param list_IDs_temp: list of label ids to load
        :return: batch of images
        """
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        X2 = np.empty((self.batch_size, self.max_length))

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = self._load_image(self.image_path[ID], self.img_w, self.img_h)
            X2[i,] = (self.train_input_questions[ID]).tolist()
        ole = [X2, X]
        
        return ole

    def _generate_y(self, list_IDs_temp):
        """Generates data containing batch_size masks
        :param list_IDs_temp: list of label ids to load
        :return: batch if masks
        """
        y = np.empty((self.batch_size, 1), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            y[i] = self.list_IDs[ID]

        return y

    def _load_image(self, image_path, img_w, img_h):
        """Load grayscale image
        :param image_path: path to image to load
        :return: loaded image
        """
        #img = cv2.imread(image_path)
        #img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        #img = img / 255
        if self.to_fit:
            image = cv2.imread("/kaggle/input/ann-and-dl-vqa/dataset_vqa/train/" + image_path)
        else:
            image = cv2.imread("/kaggle/input/ann-and-dl-vqa/dataset_vqa/test/" + image_path)   
        image = cv2.resize(image, (img_w, img_h))
        image = image/ 255.
        return image

Now we have to load the data to train and to test the model through the JSON files.
There is also the definition of 2 functions that extract the correct label from the answer element ("no" will be indicated by "11" in the json files).

In [4]:
with open('/kaggle/input/ann-and-dl-vqa/dataset_vqa/train_data.json', 'r') as f1:
      data = json.load(f1)
f1.close()

with open('/kaggle/input/ann-and-dl-vqa/dataset_vqa/test_data.json', 'r') as f2:
      data_test = json.load(f2)
f2.close()

#--------------------------------------------------
def get_correct_label(answer):
    return {
        '0': 0,
        '1': 1,
        '10': 2,
        '2': 3,
        '3': 4,
        '4': 5,
        '5': 6,
        '6': 7,
        '7': 8,
        '8': 9,
        '9': 10,
        'no': 11,
        'yes': 12
    }.get(answer)

def get_correct_class(answer):
    return {
        0: 0,
        1: 1,
        2: 10,
        3: 2,
        4: 3,
        5: 4,
        6: 5,
        7: 6,
        8: 7,
        9: 8,
        10: 9,
        11: "no",
        12: "yes"
    }.get(answer)

To generate the lists of questions, images and answers we need to save the length of the json files. Further we set the SEED, the BS, the NUM_CLASSES and the images size.

In [5]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
img_w = 320
img_h = 240
num_classes = 13
batch_size = 64

Here the validation split is defined (we choose 0.04 because the train set is very large). The definition of the different lists we need in order to define the DataGenerators and the tokenizer follow.

We decided also to add to each question the terms StartOfSentence and EndOfSentence at the begin and at the end of them.

In [6]:
train_questions_number = len(data['questions'])
#train_questions_number = 12000
test_questions_number = len(data_test['questions'])

train_questions = []
train_images = []
train_answers = []

valid_questions = []
valid_images = []
valid_answers = []

test_questions = []
test_images = []
test_ids = []

validation_split = 0.1

data.setdefault

# Set boundaries in train and valid datasets

VALID_EXAMPLES = math.trunc((train_questions_number*validation_split))
TRAIN_QUESTIONS = train_questions_number - VALID_EXAMPLES

# Read the data and extract the relative questions, images and answers for training
for i in range(TRAIN_QUESTIONS):
    question = data["questions"][i]["question"]
    question = "<sos> " + question + " <eos>"
    train_questions.append(question)
    image_path = data["questions"][i]["image_filename"]
    #image = cv2.imread("/kaggle/input/ann-and-dl-vqa/dataset_vqa/train/" + image_path)
    #image = cv2.resize(image, (150, 100))
    #image = image/ 255.
    train_images.append(image_path)
    answer = data["questions"][i]["answer"]
    answer = get_correct_label(answer)
    train_answers.append(answer)

# Read the data and extract the relative questions, images and answers for validation
for i in range(TRAIN_QUESTIONS, TRAIN_QUESTIONS + VALID_EXAMPLES):
    question = data["questions"][i]["question"]
    question = "<sos> " + question + " <eos>"
    valid_questions.append(question)
    image_path = data["questions"][i]["image_filename"]
    #image = cv2.imread("/kaggle/input/ann-and-dl-vqa/dataset_vqa/train/" + image_path)
    #image = cv2.resize(image, (150, 100))
    #image = image / 255.
    valid_images.append(image_path)
    answer = data["questions"][i]["answer"]
    answer = get_correct_label(answer)
    valid_answers.append(answer)

# Read the data_test and extract the relative questions and images
for i in range(test_questions_number):
    question = data_test["questions"][i]["question"]
    question = "<sos> " + question + " <eos>"
    test_questions.append(question)
    image_path = data_test["questions"][i]["image_filename"]
    #image = cv2.imread("/kaggle/input/ann-and-dl-vqa/dataset_vqa/test/" + image_path)
    #image = cv2.resize(image, (150, 100))
    #image = image / 255.
    test_images.append(image_path)
    test_id = data_test["questions"][i]["question_id"]
    test_ids.append(test_id)

Next we have the part about the tokenizer tool. 

It permits to transform a list of string in a matrix of integers (in order to give it in input to our keras model). The matrix will have dimensions A x B where A is the number of strings and B is the maximum number of words in the string lists.

For example if we have a list composed by "Hi, my name is Ted" and "Hello Ted, my name is Mark" the tokenizer assign an unique id to each word:

Hi: 1, my: 2, name: 3, is: 4, Ted: 5, Hello: 6, Mark: 7. After that it rebuilds the strings using the ids.

"1 2 3 4 5" "6 5 2 3 4 7" ... it checks the maximum number of words (6 here) and builds the final matrix:

[ [0 1 2 3 4 5], [6 5 2 3 4 7] ]

In [7]:
tokenizer = Tokenizer()

tokenizer.fit_on_texts(train_questions)

sequences = tokenizer.texts_to_sequences(train_questions)
max_length = max(len(sequence) for sequence in sequences)
train_input_questions = pad_sequences(sequences, maxlen=max_length)

tokenizer.fit_on_texts(valid_questions)
sequences = tokenizer.texts_to_sequences(valid_questions)
maronn = max(len(sequence) for sequence in sequences)
valid_input_questions = pad_sequences(sequences, maxlen=max_length)

tokenizer.fit_on_texts(test_questions)
sequences = tokenizer.texts_to_sequences(test_questions)
test_input_questions = pad_sequences(sequences, maxlen=max_length)

words_number = len(tokenizer.word_index) + 1

Here we define the generators! Train and validation generators don't need special shrewdnesses, the test one instead yes: for example we must set the to_fit=False in order to not try to extract any type of answer (we don't have them!).

In [8]:
training_generator = DataGenerator(train_answers, train_images, 
                                   train_input_questions, max_length, 
                                   batch_size=batch_size, dim=(img_h, img_w))
validation_generator = DataGenerator(valid_answers, valid_images, 
                                     valid_input_questions, max_length, 
                                     batch_size=batch_size, dim=(img_h, img_w))
test_generator = DataGenerator(test_ids, test_images, 
                               test_input_questions,  max_length, 
                               to_fit=False, batch_size=1, 
                               dim=(img_h, img_w), n_classes=num_classes, shuffle=False)

Let's define our model! For understanding the images we have a convolutional network with transfer learning from VGG16 and some added dropout layers.

For understanding the questions we use a bidirection GRU Recurrent Network (seems better than LSTM from papers about them).

Finally we concatenate the two pieces adding the final dense part (adding also here some droput levels).

In [9]:
base_model = tf.keras.applications.InceptionResNetV2(input_shape=(img_h, img_w, 3), include_top=False, weights='imagenet')
for i in range(len(base_model.layers) - 40):
    base_model.layers[i].trainable = False

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

vision_model = tf.keras.models.Sequential()
vision_model.add(tf.keras.layers.Dropout(0.2))
vision_model.add(base_model)
vision_model.add(global_average_layer)
vision_model.add(tf.keras.layers.Dropout(0.5))
vision_model.add(tf.keras.layers.Flatten())

image_input = tf.keras.layers.Input(shape=(img_h, img_w, 3))
encoded_image = vision_model(image_input)

# Define RNN for language input
question_input = tf.keras.layers.Input(shape=[max_length])
embedded_question = tf.keras.layers.Embedding(input_dim=words_number, output_dim=1024, input_length=max_length)(question_input)
encoded_question = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(1024, dropout=0.4, recurrent_dropout=0.2))(embedded_question)

# Combine CNN and RNN to create the final model
merged = tf.keras.layers.concatenate([encoded_question, encoded_image])
output = tf.keras.layers.Dense(1024, activation='relu')(merged)
output = tf.keras.layers.Dropout(0.5)(output)
output = tf.keras.layers.Dense(512, activation='relu')(output)
output = tf.keras.layers.Dropout(0.3)(output)
output = tf.keras.layers.Dense(256, activation='relu')(output)
output = tf.keras.layers.Dropout(0.3)(output)
output = tf.keras.layers.Dense(128, activation='relu')(output)
output = tf.keras.layers.Dropout(0.5)(output)
output = tf.keras.layers.Dense(num_classes, activation='softmax')(output)
vqa_model = tf.keras.models.Model(inputs=[question_input, image_input], outputs=output)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_resnet_v2/inception_resnet_v2_weights_tf_dim_ordering_tf_kernels_notop.h5


We define the optimizer, the loss, the metric and we can fit the model!

In [14]:
optimizer = tf.keras.optimizers.RMSprop()
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metrics = ['accuracy']

vqa_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

vqa_model.fit_generator(generator=training_generator, validation_data=validation_generator, epochs=1)

  42/3649 [..............................] - ETA: 58:00 - loss: 1.6888 - accuracy: 0.6217

KeyboardInterrupt: 

We have now to predict the test questions and to create the csv file for kaggle submission. (Here the model is not really trained, for time issues).

In [12]:
def create_csv(results_dir='./'):

    csv_fname = 'results_'
    csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

    with open(os.path.join('./', csv_fname), 'w') as f:

        f.write('Id,Category\n')

        for i in range(len(pred)):
            f.write(str(test_ids[i]) + ',' + str(np.argmax(pred[i])) + '\n')

In [None]:
pred = vqa_model.predict_generator(test_generator)
create_csv()