# CS230 Project
# Deep Learning for VQA: Visual Question Answering

Stephanie Do <br> Alona King <br> Jennifer Villa

## Introduction

Our project explores the challenge of visual question answering (VQA) -- given an image and an open ended question concerning the image, build a model that returns a correct answer. This topic requires synthesizing both visual and language modalities, and combining the two to produce a natural language answer, making it more challenging than traditional image classification. VQA challenges researchers to create networks with a more sophisticated level of understanding that could ultimately be used to help robots or drones navigate their environment. These networks could also give visually impaired people a more rich description of a scene, or be used for better image or product search within a database.


## Dataset Description
For this project, we will be using the VQA v2.0 dataset. Unlike VQA 1.0, which included both real and abstract scenes, VQA 2.0 only looks at real images. The task is also slightly different between versions - v1.0 included both open ended and multiple choice question answering, whereas v2.0 focuses exclusively on open ended question answering. 

<br> The VQA 2.0 dataset is a collection of 82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images. Each image has 3+ associated questions, for a total of 443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing. Each question is associated with 10 ground truth answers, corresponding to the answers of 10 different human respondents when asked given the image-question pair. The dataset also includes a field identifying the most frequent ground truth answer of this set. <br>

Examples from the VQA v2.0 dataset <br>
Question: What color is the hydrant? <br> <img src="FireHydrant.png"> <br> Answer: Red



Question:  What is hanging above the toilet? <br> <img src="TeddyBear.png"> <br>  Answer: teddy bear

<br> VQA 2.0 also includes a "complementary pairs" dataset. These are pairs of images that share the same question, but the answer to that question is different for each image (see below for example). Some [research](https://arxiv.org/pdf/1612.00837.pdf) has shown that training with this dataset improves model accuracy and prevents the model from overfitting to the most common answers. As of now, we are not usin this dataset, but we may investigate using it as an extension.  <img src="PairedImages.png"> 

## Evaluation Metric

The VQA Challenge has set up its own evaluation platform using EvalAI. The metric used for the challenge is <br> <br>
$Acc(ans) = min \{ \frac{\text{num humans that said ans} }{3}, 1 \} $
<br> The metric accounts for the fact that human respondents might give slightly different answers for a question. When asked "what color is the scarf?", one set of respondents might say "blue", while another set might say "purple."


## Progress

As of now, we have loaded and run the original VQA model described in this [paper](https://arxiv.org/pdf/1505.00468.pdf). The model (which can be found [here](https://github.com/anantzoid/VQA-Keras-Visual-Question-Answering) is implemented in Keras with a Tensorflow backend. 
<br><br> Embeddings for the input image are taken from the last hidden layer of VGG19, which is a 4096 dimensional vector. Rather than run the images through VGG19 layers repeatedly, the authors of this network saved the embeddings for the images and use those as inputs to their network, rather than the raw images themselves. This is useful because this reduces computational intensity of the network, but it means that when we decide to change our CNN embedding, we will have to go back to using the raw images as input. 
<br><br>The 4096 element image embedding is then fed to a fully connected layer with 1024 output neurons. 
<br><br> The embeddings for the question are generated by taking a word2vec embeddeding matrix and multiplying it with each word.


In [1]:
import tensorflow as tf

  return f(*args, **kwds)
