# CS230 Project
# Deep Learning for VQA: Visual Question Answering

Stephanie Do <br> Alona King <br> Jennifer Villa

## Introduction

Our project explores the challenge of visual question answering (VQA) -- given an image and an open ended question concerning the image, build a model that returns a correct answer. This topic requires synthesizing both visual and language modalities, and combining the two to produce a natural language answer, making it more challenging than traditional image classification. VQA challenges researchers to create networks with a more sophisticated level of understanding that could ultimately be used to help robots or drones navigate their environment. These networks could also give visually impaired people a more rich description of a scene, or be used for better image or product search within a database.


## Dataset Description
For this project, we will be using the VQA v2.0 dataset. Unlike VQA 1.0, which included both real and abstract scenes, VQA 2.0 only looks at real images. The task is also slightly different between versions - v1.0 included both open ended and multiple choice question answering, whereas v2.0 focuses exclusively on open ended question answering. 

<br> The VQA 2.0 dataset is a collection of 82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images. Each image has 3+ associated questions, for a total of 443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing. Each question is associated with 10 ground truth answers, corresponding to the answers of 10 different human respondents when asked given the image-question pair. The dataset also includes a field identifying the most frequent ground truth answer of this set. <br>

<br> Questions are broken into 3 sub-groups, based on their answer types: "yes/no", "number", and "other." The VQA challenge reports model accuracy for each sub-group, as well as an overall number. 

Examples from the VQA v2.0 dataset <br>
Question: What color is the hydrant? <br> <img src="FireHydrant.png"> <br> Answer: Red



Question:  What is hanging above the toilet? <br> <img src="TeddyBear.png"> <br>  Answer: teddy bear

<br> VQA 2.0 also includes a "complementary pairs" dataset. These are pairs of images that share the same question, but the answer to that question is different for each image (see below for example). Some [research](https://arxiv.org/pdf/1612.00837.pdf) has shown that training with this dataset improves model accuracy and prevents the model from overfitting to the most common answers. As of now, we are not using this dataset, but we may investigate using it as an extension.  <img src="PairedImages.png"> 

## Evaluation Metric

The VQA Challenge has set up its own evaluation platform using EvalAI. The metric used for the challenge is <br> <br>
$Acc(ans) = min \{ \frac{\text{num humans that said ans} }{3}, 1 \} $
<br> The metric accounts for the fact that human respondents might give slightly different answers for a question. When asked "what color is the scarf?", one set of respondents might say "blue", while another set might say "purple." If at least 3 of the 10 respondents give a particular answer, the answer is considered a full credit answer. Otherwise, fractional credit is awarded based on the number of people who gave that answer. 


## First Steps -- Initial VQA Model

As of now, we have loaded and run the original VQA model described in this [paper](https://arxiv.org/pdf/1505.00468.pdf). The model (which can be found [here](https://github.com/anantzoid/VQA-Keras-Visual-Question-Answering)) is implemented in Keras with a Tensorflow backend. 
<br><br> 
We begin with all necessary import statements. 


In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM, Flatten, Embedding, Merge, Input, Multiply
from keras.layers.merge import Multiply
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
import h5py

Using TensorFlow backend.
  return f(*args, **kwds)


### Image Embedding
Embeddings for the input image are taken from the last hidden layer of VGG19, which is a 4096 dimensional vector. Rather than run the images through VGG19 layers repeatedly, the authors of this network saved the embeddings for the images and use those as inputs to their network, rather than the raw images themselves. This is useful because this reduces computational intensity of the network, but it means that when we decide to change our CNN embedding, we will have to go back to using the raw images as input. 
<br><br>The 4096 element image embedding is then fed to a fully connected layer with 1024 output neurons and a tanh activation function. 

In [2]:
def img_model(dropout_rate):
    print("Creating image model...")
    model = Sequential()
    ##Feed the 4096 element image embedding through fully connected layer with 1024 output neurons and tanh activation
    model.add(Dense(1024, input_dim=4096, activation='tanh'))
    return model

### Word Embedding
Using a previously trained GloVe embedding matrix, 300 element word2vec representations are created for each word in the question. The sequence of vectors is then fed to an LSTM with 2 hidden layers (with dropout applied). The output of this LSTM is then connected to a dense layer with 1024 output nerons and a tanh activation function. Because the embedding layer is instantiated with the trainable parameter set to false, the GloVe embedding matrix weights are not adjusted during training. 
<br> **Note:** The VQA paper says that it concatenates the last cell state and the last hidden state from both LSTM layers to form a 2048-dim embedding for the question, which is then fed to the 1024 unit FC layer. However, it is not clear from the code below that such a concatenation is being done. It looks like exclusively the last hidden state output from the 2nd hidden layer is being used. 

In [3]:
def Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate):
    print("Creating text model...")
    model = Sequential()
    model.add(Embedding(num_words, embedding_dim, 
        weights=[embedding_matrix], input_length=seq_length, trainable=False))
    model.add(LSTM(units=512, return_sequences=True, input_shape=(seq_length, embedding_dim)))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(units=512, return_sequences=False))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1024, activation='tanh'))
    return model

### Bringing it together -- Combining Image and Word Embeddings
Having created two 1024 dimensional embeddings, one for the image and one for the question, the model then merges these two. This merging is done via elementwise multiplication. 
The resulting 1024-element vector is then fed to a fully connected layer with 1000 output neurons and a tanh activation function. From there, it is fed to another fully connected layer with "num_classes" output neurons. "num_classes" represents the number of answers possible for the questions; each neuron maps to one answer. A softmax activation is used to reflect the fact that the final output is a probability vector whose elements sum to 1. The numerical value at any particular element in the output vector represents the probability that answer is the correct one for a particular image-question pair. The model's answer is the the answer with the max probability in the output vector. 
#### Loss function and training
The model uses "categorical_crossentropy" as its loss function. This corresponds to the cross entropy metric defined in class; if $\hat{y}$ is the softmax output reflecting probabilities weightings across all Z possible answers, then $y$ is a Z dimensions vector with a '1' at the position of the ground truth answer and '0' in all other positions. Given our specific dataset, the '1' is at the position of the most frequent answer given by the 10 human respondents.

The model uses RMSprop as its optimization algorithm, with default hyperparameters (this is suggested in Keras documentation). Learning rate is 0.001, $\rho$ is 0.9 (this was $\beta$ in our lecture videos; the weighting of the current gradient relative to the historical average), $\epsilon=1*10^{-8}$ (this is the 'fuzz' factor to protect against divide by zero errors), and learning rate decay is 0. 



In [4]:
def vqa_model(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes):
    vgg_model = img_model(dropout_rate)
    lstm_model = Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate)
    print("Merging final model...")
    fc_model = Sequential()
    fc_model.add(Merge([vgg_model, lstm_model], mode='mul')) #Merge type layer now deprecated
    fc_model.add(Dropout(dropout_rate))
    fc_model.add(Dense(1000, activation='tanh'))
    fc_model.add(Dropout(dropout_rate))
    fc_model.add(Dense(num_classes, activation='softmax'))
    
    #Setup loss function and defining training algorithm
    fc_model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
        metrics=['accuracy'])
    return fc_model

## Testing the Model
To test our model, we ran one training iteration of the model. Again, we begin with the necessary import statements. 

In [9]:
import numpy as np
from keras.models import model_from_json
from keras.callbacks import ModelCheckpoint
import os
import argparse
from prepare_data import *
from constants import *

Next we define the get model function which creates the model using our previously defined vqa_model function. Before that, it prepares the embedding matrix which will be used to generate the word2vec representations of each of the question words. This function also checks for previously saved weights and loads them if found. 

In [10]:
def get_model(dropout_rate, model_weights_filename):
    print("Creating Model...")
    metadata = get_metadata()
    num_classes = len(metadata['ix_to_ans'].keys())
    num_words = len(metadata['ix_to_word'].keys())

    embedding_matrix = prepare_embeddings(num_words, embedding_dim, metadata)
    model = vqa_model(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes)
    if os.path.exists(model_weights_filename):
        print("Loading Weights...")
        model.load_weights(model_weights_filename)
    else:
        print("No weights found at " + model_weights_filename)

    return model

Next we define functions for training and validation.  

In [11]:
def train(epoch, batch_size, data_limit):
    dropout_rate = 0.5
    train_X, train_y = read_data(data_limit)    
    model = get_model(dropout_rate, model_weights_filename)
    checkpointer = ModelCheckpoint(filepath=ckpt_model_weights_filename, verbose=1, monitor='loss', save_best_only=True)
    model.fit(train_X, train_y, epochs=epoch, batch_size=batch_size, callbacks=[checkpointer], shuffle="batch")
    print("Training Complete!")
    model.save_weights(ckpt_model_weights_filename + ".final", overwrite=True)
    
def val():
    val_X, val_y, multi_val_y = get_val_data() 
    model = get_model(0.0, model_weights_filename)
    print("Evaluating Accuracy on validation set:")
    metric_vals = model.evaluate(val_X, val_y)
    print("")
    for metric_name, metric_val in zip(model.metrics_names, metric_vals):
        print(str(metric_name) + " is " + str(metric_val))

    # Comparing prediction against multiple choice answers
    true_positive = 0
    preds = model.predict(val_X)
    pred_classes = [np.argmax(_) for _ in preds]
    for i, _ in enumerate(pred_classes):
        if _ in multi_val_y[i]:
            true_positive += 1
    print("True positive rate: " +  str(np.float(true_positive)/len(pred_classes)))

Let's start with a test of the validation function to see how well our model does with the weights provided by the model authors. After running the line of code below to test the model on the entire validation set, the following results were obtained:
<br><br>
**Loss:** 2.76330976921 <br>
**Accuracy:** 0.460777536375
<br><br>
This is approximately equivalent to the 45.03% validation accuracy reported in the original author's documentation. 

In [None]:
val()

Now lets test the training function. For a quick test, lets only run 1 epoch with a batch size of 10, and only use 10 question-image pairs (instead of the full training set). 

In [None]:
train(epoch=3, batch_size=10, data_limit=10)

## Migrating to a Functional-Style Keras Model
### Merge Layer Deprecation
When we ran the initial model, we saw warnings that as of Keras 2.0, the Merge layer used in `vqa_model()` has been deprecated. The suggested fix would be to instead use a Multiply layer to perform the elementwise multiplication, but the Multiply layer will only accept tensors. Several workarounds were tried, including using the new Multiply layer with "vgg_model.output" and "lstm_model.output", but we ran into issues there because we were using "symbolic tensors" rather than actual tensors. 

The proper approach, recommended by the Keras development community, is to switch from sequential style models, heavily utilized in Keras 1.0, to functional models, recommanded de-facto for Keras 2.0. Hence, we have redefined our models in functional Keras syntax.

In [15]:
def img_model_func(dropout_rate):
    print("Creating functional image model...")
    input_img = Input((4096,))
    img_embedding = Dense(1024, input_dim=4096, activation='tanh')(input_img)
    return input_img, img_embedding

In [16]:
def Word2VecModel_func(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate):
    print("Creating functional text model...")
    input_q = Input((seq_length,))
    x = Embedding(num_words, embedding_dim, weights=[embedding_matrix], trainable=False)(input_q)
    x = LSTM(units=512, return_sequences=True)(x)
    x = Dropout(dropout_rate)(x)
    x = LSTM(units=512, return_sequences=False)(x)
    x = Dropout(dropout_rate)(x)
    q_embedding = Dense(1024, activation='tanh')(x)
    return input_q, q_embedding

In [17]:
def vqa_model_func(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes):
    input_img, img_embedding = img_model(dropout_rate)
    input_q, q_embedding = Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate)
    print("Merging final model...")
    combined = Multiply()([img_embedding, q_embedding])
    combined = Dropout(dropout_rate)(combined)
    combined = Dense(1000, activation='tanh')(combined)
    combined = Dropout(dropout_rate)(combined)
    predictions = Dense(num_classes, activation='softmax')(combined)
    fc_model = Model(inputs=[input_img, input_q], outputs=predictions)
    fc_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
    return fc_model

Refactoring to functional style models breaks the ability to load the pre-trained weights, trained on the sequential model definition.

Therefore, to validate our functional model implementation, we trained the model from scratch overnight, and recorded evaluation results against the validation set at every 10 epochs. We wrote a function `loop(args)` to print out a heartbeat as the training continued.


In [None]:
def loop(args):
    for i in range(1, args.num_loops + 1):
        model = train(args)
        if args.save_all:
            model.save_weights(model_weights_filename+"_epoch_"+str(i*args.epoch), overwrite=False)
        metrics, true_positive_rate = val()
        with open("training_log", "a") as val_log:
            val_log.write("After training epoch " + str(args.epoch * i)+"\n")
            for name, value in metrics:
                val_log.write(name + " " + str(value)+"\n")
            val_log.write("True_positive_rate: " + str(true_positive_rate)+"\n")
        print("Finished loop number: ", i)

Upon running the `loop` function, our results were as follows: <br>

|  Epoch        | Accuracy            | True Postitve Rate   | Loss
| :-------------: |:-------------| :-----|:-----|
| 10     | 0.445552702614 | 0.573844558562 |2.53969596104|
| 20      | 0.454802817829      |   0.58548949898 |2.59171895231|
| 30 |         0.45152742116|0.582181183751  |2.6666429342|
|40|0.458144051616|  0.587061360195 |2.76133057947
|50 |0.453848179604 |0.581045493449| 2.9734828312
|60 |0.449074988479 |0.577926459938 |3.08288818925
|70 |0.450976035289 |0.577959378498 |3.19416170537
|80 |0.449881493186 |0.577910000658 |3.32517784668
|90 | 0.445272894858| 0.572231549147 |3.3803069139
|100 |0.44325663309 |0.571375666601|3.44215128993

Here, **Accuracy** measures the rate at which the predicted answer is the same as the top human answer.

**True Positive Rate** measures the rate at which the predicted answer matches any one of the 10 human provided answers.

**Loss** measures the categorical cross-entropy on the validation set.

After training 30 epochs, our functional model achieved an accuracy on the validation set of 45.15%, which is almost identical to the 45.03% accuracy figure reported by the original implementers of the sequential model. We believe the differences between our accuracies are caused by using difference batch_sizes. 

Our model achieved highest accuracy after 40 epochs, with an accuracy of 45.81%. After 40 epochs, both the accuracy and true positive rate declines, indicating overfitting of the model.

## Next Steps
1. **Test different image embeddings**: Currently, our model uses image embeddings taken from the last hidden layer of VGG19. We would like to test the performance of embeddings taken from other well-known CNN models. Keras has [open source implementations](https://github.com/fchollet/keras/tree/master/keras/applications) of Resnet-50, Inception-Resnet_v2, Inception_v3, and Xception. A search of Github also reveals a Keras [Resnet-152](https://gist.github.com/flyyufelix/7e2eafb149f72f4d38dd661882c554a6) implementation, which might also be worth trying since this was the [model](https://web.stanford.edu/class/cs224n/reports/2748290.pdf) used by another Stanford team. As with our existing model, it probably makes sense to pre-compute the image features by running the VQA dataset images through the chosen network for embedding. Using our new set of image features, we can re-train our model to see if we can achieve better accuracy. 
2. **Implement New Cost Function for Soft Cross Entropy Loss**: Currently our model uses binary cross-entropy loss, with the ground-truth answer being a one-hot vector encoding the answer given by the majority of the human respondents polled. However, the VQA metric actually awards partial credit to models that output answers that match any of the ten human respondent answers. Thus, there is currently a disconnect between the loss function and the evaluation metric. One of the top performers in this year's VQA challenge sought to address this by proposing a [soft cross entropy loss](https://ilija139.github.io/pub/cvpr2017_vqa.pdf) function. This function calculates a weighted average of all unique ground truth answers given by the 10 human respondents. In the paper, these researchers achieved ~1.2-1.6% improved accuracy across a variety of model architectures. We would like to see if this improvement translates to our model as well. 
3. **Implement Multi-Modal Factorized Bilinear Pooling**: The [second place winner](https://arxiv.org/pdf/1708.01471.pdf) in this year’s VQA challenge proposed the concept of multi-modal factorized bilinear pooling. The idea is that using concatenation or elementwise multiplication to combine image and question embeddings (each of which represents information from a different modality, hence the term “multi-modal”) may limit model performance. A more sophisticated approach to fusing these two might be necessary. Thus they proposed “multi-modal factorized bilinear pooling”, which amounts to a combination of element-wise multiplication, fully connected layers, and sum pooling. We would like to see the impact of substituting this technique in place of our current “Multiply” layer in our VQA model. Once we implement co-attention (see $4$), we could also incorporate this technique into that model. (Authors of the original paper saw improvement in both simpler model architectures, such as the one we currently have, and those with co-attention).
4. **Implement Image-Question Co-attention**: All of the top performing VQA models use some form of attention. While early research focused exclusively on image attention [[1]](https://arxiv.org/abs/1511.07394), more recent work has combined image attention with question attention [[2]](https://arxiv.org/abs/1606.00061) [[3]](https://web.stanford.edu/class/cs224n/reports/2748290.pdf). The philosophy behind the latter approach is just as certain regions of the image are more relevant than others, certain words in the question are more helpful in answering than others. A variety of different co-attention architectures have been proposed, including hierarchical co-attention, which looks at question attention recursively on word, phrase, and full question levels. Other co-attention models (like [3]) include information indicating the part of speech of a word. We plan on starting with an architecture like the one proposed in the "Multi-Modal Factorized Bilinear Pooling" paper, which won second place [[4]](https://arxiv.org/pdf/1708.01471.pdf). The experimenters provide their [source code](https://github.com/yuzcccc/vqa-mfb), but it is written in Caffe. 
