# CS230 Project
# Deep Learning for VQA: Visual Question Answering

Stephanie Do <br> Alona King <br> Jennifer Villa

## Introduction

Our project explores the challenge of visual question answering (VQA) -- given an image and an open ended question concerning the image, build a model that returns a correct answer. This topic requires synthesizing both visual and language modalities, and combining the two to produce a natural language answer, making it more challenging than traditional image classification. VQA challenges researchers to create networks with a more sophisticated level of understanding that could ultimately be used to help robots or drones navigate their environment. These networks could also give visually impaired people a more rich description of a scene, or be used for better image or product search within a database.


## Dataset Description
For this project, we will be using the VQA v2.0 dataset. Unlike VQA 1.0, which included both real and abstract scenes, VQA 2.0 only looks at real images. The task is also slightly different between versions - v1.0 included both open ended and multiple choice question answering, whereas v2.0 focuses exclusively on open ended question answering. 

<br> The VQA 2.0 dataset is a collection of 82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images. Each image has 3+ associated questions, for a total of 443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing. Each question is associated with 10 ground truth answers, corresponding to the answers of 10 different human respondents when asked given the image-question pair. The dataset also includes a field identifying the most frequent ground truth answer of this set. <br>

<br> Questions are broken into 3 sub-groups, based on their answer types: "yes/no", "number", and "other." The VQA challenge reports model accuracy for each sub-group, as well as an overall number. 

Examples from the VQA v2.0 dataset <br>
Question: What color is the hydrant? <br> <img src="FireHydrant.png"> <br> Answer: Red



Question:  What is hanging above the toilet? <br> <img src="TeddyBear.png"> <br>  Answer: teddy bear

<br> VQA 2.0 also includes a "complementary pairs" dataset. These are pairs of images that share the same question, but the answer to that question is different for each image (see below for example). Some [research](https://arxiv.org/pdf/1612.00837.pdf) has shown that training with this dataset improves model accuracy and prevents the model from overfitting to the most common answers. As of now, we are not usin this dataset, but we may investigate using it as an extension.  <img src="PairedImages.png"> 

## Evaluation Metric

The VQA Challenge has set up its own evaluation platform using EvalAI. The metric used for the challenge is <br> <br>
$Acc(ans) = min \{ \frac{\text{num humans that said ans} }{3}, 1 \} $
<br> The metric accounts for the fact that human respondents might give slightly different answers for a question. When asked "what color is the scarf?", one set of respondents might say "blue", while another set might say "purple."


## First Steps -- Initial VQA Model

As of now, we have loaded and run the original VQA model described in this [paper](https://arxiv.org/pdf/1505.00468.pdf). The model (which can be found [here](https://github.com/anantzoid/VQA-Keras-Visual-Question-Answering)) is implemented in Keras with a Tensorflow backend. 
<br><br> 
We begin with all necessary import statements. 


In [49]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM, Flatten, Embedding, Merge
from keras.layers.merge import Multiply
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
import h5py

### Image Embedding
Embeddings for the input image are taken from the last hidden layer of VGG19, which is a 4096 dimensional vector. Rather than run the images through VGG19 layers repeatedly, the authors of this network saved the embeddings for the images and use those as inputs to their network, rather than the raw images themselves. This is useful because this reduces computational intensity of the network, but it means that when we decide to change our CNN embedding, we will have to go back to using the raw images as input. 
<br><br>The 4096 element image embedding is then fed to a fully connected layer with 1024 output neurons and a tanh activation function. 

In [50]:
def img_model(dropout_rate):
    print("Creating image model...")
    model = Sequential()
    ##Feed the 4096 element image embedding through fully connected layer with 1024 output neurons and tanh activation
    model.add(Dense(1024, input_dim=4096, activation='tanh'))
    return model

### Word Embedding
Using a previously trained embedding matrix, 300 element word2vec representations are created for each word in the question. The sequence of vectors is then fed to an LSTM with 2 hidden layers (with dropout applied). The output of this LSTM is then connected to a dense layer with 1024 output nerons and a tanh activation function. 
<br> **Note:** The VQA paper says that it concatenates the last cell state and the last hidden state from both LSTM layers to form a 2048-dim embedding for the question, which is then fed to the 1024 unit FC layer. However, it is not clear from the code below that such a concatenation is being done. It looks like exclusively the last hidden state output from the 2nd hidden layer is being used. 

In [51]:
def Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate):
    print("Creating text model...")
    model = Sequential()
    model.add(Embedding(num_words, embedding_dim, 
        weights=[embedding_matrix], input_length=seq_length, trainable=False))
    model.add(LSTM(units=512, return_sequences=True, input_shape=(seq_length, embedding_dim)))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(units=512, return_sequences=False))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1024, activation='tanh'))
    return model

### Bringing it together -- Combining Image and Word Embeddings
Having created two 1024 dimensional embeddings, one for the image and one for the question, the model then merges these two. This merging is done via elementwise multiplication. 
The resulting 1024-element vector is then fed to a fully connected layer with 1000 output neurons and a tanh activation function. From there, it is fed to another fully connected layer with "num_classes" output neurons. "num_classes" represents the number of answers possible for the questions; each neuron maps to one answer. A softmax activation is used to reflect the fact that the final output is a probability vector whose elements sum to 1. The numerical value at any particular element in the output vector represents the probability that answer is the correct one for a particular image-question pair. The model's answer is the the answer with the max probability in the output vector. 
#### Loss function and training
The model uses "categorical_crossentropy" as its loss function. This corresponds to the cross entropy metric defined in class; if $\hat{y}$ is the softmax output reflecting probabilities weightings across all Z possible answers, then $y$ is a Z dimensions vector with a '1' at the position of the ground truth answer and '0' in all other positions. Given our specific dataset, the '1' is at the position of the most frequent answer given by the 10 human respondents.

The model uses RMSprop as its optimization algorithm, with default hyperparameters (this is suggested in Keras documentation). Learning rate is 0.001, $\rho$ is 0.9 (this was $\beta$ in our lecture videos; the weighting of the current gradient relative to the historical average), $\epsilon=1*10^{-8}$ (this is the 'fuzz' factor to protect against divide by zero errors), and learning rate decay is 0. 

#### Merge Layer Deprecation
In Keras 2.0, the Merge layer has been deprecated. The suggested fix would be to instead use a Multiply layer to perform the elementwise multiplication, but the Multiply layer will only accept tensors. Several workarounds were tried, including using the new Multiply layer with "vgg_model.output" and "lstm_model.output", but we ran into issues there because we were using "symbolic tensors" rather than actual tensors. It appears that the proper approach would be to switch from "sequential" style models to functional models, but that in turn breaks the ability to load the pre-trained weights. 
<br>**Stephanie, please read over / fix my description as necessary **

In [66]:
def vqa_model(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes):
    vgg_model = img_model(dropout_rate)
    lstm_model = Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate)
    print("Merging final model...")
    fc_model = Sequential()
    fc_model.add(Merge([vgg_model, lstm_model], mode='mul')) #Merge type layer now deprecated
    fc_model.add(Dropout(dropout_rate))
    fc_model.add(Dense(1000, activation='tanh'))
    fc_model.add(Dropout(dropout_rate))
    fc_model.add(Dense(num_classes, activation='softmax'))
    
    #Setup loss function and defining training algorithm
    fc_model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
        metrics=['accuracy'])
    return fc_model

## Running the model 
Next we will run one training iteration of the model. Again, we begin with the necessary import statements. 

In [67]:
import numpy as np
from keras.models import model_from_json
from keras.callbacks import ModelCheckpoint
import os
import argparse
##from models import *  IGNORING THIS IMPORT, SINCE ALREADY DEFINED THE NECESSARY FUNCTIONS ABOVE
from prepare_data import *
from constants import *

Next we define the get model function which creates the model using our previously defined vqa_model function. Before that, it prepares the embedding matrix which will be used to generate the word2vec representations of each of the question words. This function also checks for previously saved weights and loads them if found. 

In [68]:
def get_model(dropout_rate, model_weights_filename):
    print("Creating Model...")
    metadata = get_metadata()
    num_classes = len(metadata['ix_to_ans'].keys())
    num_words = len(metadata['ix_to_word'].keys())

    embedding_matrix = prepare_embeddings(num_words, embedding_dim, metadata)
    model = vqa_model(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes)
    if os.path.exists(model_weights_filename):
        print("Loading Weights...")
        model.load_weights(model_weights_filename)
    else:
        print("No weights found at " + model_weights_filename)

    return model

Next we define functions for training and validation.  

In [77]:
def train(epoch, batch_size, data_limit):
    dropout_rate = 0.5
    train_X, train_y = read_data(data_limit)    
    model = get_model(dropout_rate, model_weights_filename)
    checkpointer = ModelCheckpoint(filepath=ckpt_model_weights_filename, verbose=1, monitor='loss', save_best_only=True)
    model.fit(train_X, train_y, epochs=epoch, batch_size=batch_size, callbacks=[checkpointer], shuffle="batch")
    print("Training Complete!")
    #For now, disable final save weights since we've already saved the best model to date
    #model.save_weights(ckpt_model_weights_filename + ".final", overwrite=True)
    
def val():
    val_X, val_y, multi_val_y = get_val_data() 
    model = get_model(0.0, model_weights_filename)
    print("Evaluating Accuracy on validation set:")
    metric_vals = model.evaluate(val_X, val_y)
    print("")
    for metric_name, metric_val in zip(model.metrics_names, metric_vals):
        print(str(metric_name) + " is " + str(metric_val))

    # Comparing prediction against multiple choice answers
    true_positive = 0
    preds = model.predict(val_X)
    pred_classes = [np.argmax(_) for _ in preds]
    for i, _ in enumerate(pred_classes):
        if _ in multi_val_y[i]:
            true_positive += 1
    print("True positive rate: " +  str(np.float(true_positive)/len(pred_classes)))

Let's start with a test of the validation function to see how well our model does with the weights provided by the model authors. 

In [42]:
val()

Creating Model...
Creating image model...
Creating text model...
Merging final model...
Loading Weights...


  


Evaluating Accuracy on validation set:
  7776/121512 [>.............................] - ETA: 14:38

KeyboardInterrupt: 

Now lets test the training function. For a quick test, lets only run 1 epoch with a batch size of 10, and only use 10 question-image pairs (instead of the full training set). 

In [78]:
train(epoch=3, batch_size=10, data_limit=10)

Reading Data...
Creating Model...
Creating image model...
Creating text model...
Merging final model...
Loading Weights...


  


Epoch 1/3
Epoch 00001: loss improved from inf to 0.84248, saving model to data/ckpts/model_weights.h5
Epoch 2/3
Epoch 00002: loss improved from 0.84248 to 0.32838, saving model to data/ckpts/model_weights.h5
Epoch 3/3
Epoch 00003: loss improved from 0.32838 to 0.01323, saving model to data/ckpts/model_weights.h5
Training Complete!
