# Visual Qustion Answering Dataset

VQA Homepage http://visualqa.org/download.html

Annotations taken from [Training annotations 2017 v2.0](http://visualqa.org/data/mscoco/vqa/v2_Annotations_Train_mscoco.zip)

Questions taken from [Training questions 2017 v2.0](http://visualqa.org/data/mscoco/vqa/v2_Questions_Train_mscoco.zip)

![title](img/vqa_examples.jpg)

In [1]:
import json
import zipfile
import random
import numpy as np
from collections import Counter, defaultdict
from time import time

In [2]:
with zipfile.ZipFile('data/v2_Questions_Train_mscoco.zip', 'r') as file:
    qdata = json.load(file.open(file.namelist()[0]))

with zipfile.ZipFile('data/v2_Annotations_Train_mscoco.zip', 'r') as file:
    adata = json.load(file.open(file.namelist()[0])) 

### Preprocessing

* Spelling correction (using Bing Speller) of question and answer strings
* Question normalization (first char uppercase, last char ‘?’)
* Answer normalization (all chars lowercase, no period except as decimal point, number words —> digits, strip articles (a, an the))
* Adding apostrophe if a contraction is missing it (e.g., convert "dont" to "don't")

## Data Exploration

### Annotation Data

In [3]:
print("# Datapoints: ", len(adata['annotations']))
print("Datapoint keys: ", adata['annotations'][0].keys())

# Datapoints:  443757
Datapoint keys:  dict_keys(['question_type', 'multiple_choice_answer', 'answers', 'image_id', 'answer_type', 'question_id'])


Let's look at some datapoints:

In [4]:
print("#1: ", adata['annotations'][0])
print("\n#2: ", adata['annotations'][1])
print("\n#3: ", adata['annotations'][2])

#1:  {'question_type': 'what is this', 'multiple_choice_answer': 'net', 'answers': [{'answer': 'net', 'answer_confidence': 'maybe', 'answer_id': 1}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 2}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 3}, {'answer': 'netting', 'answer_confidence': 'yes', 'answer_id': 4}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 5}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 6}, {'answer': 'mesh', 'answer_confidence': 'maybe', 'answer_id': 7}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 8}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 9}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 10}], 'image_id': 458752, 'answer_type': 'other', 'question_id': 458752000}

#2:  {'question_type': 'what', 'multiple_choice_answer': 'pitcher', 'answers': [{'answer': 'pitcher', 'answer_confidence': 'yes', 'answer_id': 1}, {'answer': 'catcher', 'answer_confidence': 'no', 'answer_

### Question Data

In [5]:
print("# Datapoints: ", len(qdata['questions']))
print("\nDatapoint keys: ", qdata['questions'][0].keys())

# Datapoints:  443757

Datapoint keys:  dict_keys(['image_id', 'question', 'question_id'])


Let's look at some datapoints

In [6]:
print("#1: ", qdata['questions'][0])
print("\n#2: ", qdata['questions'][1])
print("\n#3: ", qdata['questions'][2])

#1:  {'image_id': 458752, 'question': 'What is this photo taken looking through?', 'question_id': 458752000}

#2:  {'image_id': 458752, 'question': 'What position is this man playing?', 'question_id': 458752001}

#3:  {'image_id': 458752, 'question': 'What color is the players shirt?', 'question_id': 458752002}


### Dataset Statistics

In [7]:
question_types = set()
multiple_choice_answers = set()
answer2count = defaultdict(int)
answer_types = set()
answertypes2count = defaultdict(int)
top_answers_per_type = defaultdict(lambda: defaultdict(int))
for ann in adata['annotations']:
    question_types.add(ann['question_type'])
    
    multiple_choice_answers.add(ann['multiple_choice_answer'])
    
    answer2count[ann['multiple_choice_answer']] += 1
    answer_types.add(ann['answer_type'])
    
    answertypes2count[ann['answer_type']] += 1
    top_answers_per_type[ann['answer_type']][ann['multiple_choice_answer']] += 1

#### Question Types

In [8]:
print("# Unique Question Types: ", len(question_types))
print(question_types)

# Unique Question Types:  65
{'is that a', 'is this', 'does this', 'is', 'what color is', 'is the person', 'what does the', 'what is the name', 'what sport is', 'what animal is', 'how many people are in', 'what number is', 'what is', 'is this an', 'what is in the', 'was', 'what color is the', 'what is the man', 'are these', 'what room is', 'is it', 'what', 'is the man', 'is there a', 'is he', 'what is on the', 'can you', 'what kind of', 'is this a', 'none of the above', 'why is the', 'what brand', 'where is the', 'are there any', 'which', 'what is the color of the', 'what color are the', 'what is the woman', 'who is', 'do', 'are the', 'what type of', 'what color', 'what are the', 'what time', 'is this person', 'are there', 'are', 'what is the person', 'what is this', 'how many people are', 'is the', 'where are the', 'could', 'has', 'how many', 'why', 'is there', 'what are', 'are they', 'do you', 'does the', 'is the woman', 'how', 'what is the'}


#### Answer Types

In [9]:
print("Answer Types: ", answer_types)
print("Answer Type Counts: ", Counter(answertypes2count).most_common())
for t in list(answer_types):
    print("\nType '%s' Top 50 Answers %s" %(t, Counter(top_answers_per_type[t]).most_common(50)))

Answer Types:  {'other', 'number', 'yes/no'}
Answer Type Counts:  [('other', 219269), ('yes/no', 166882), ('number', 57606)]

Type 'other' Top 50 Answers [('white', 8915), ('blue', 5455), ('red', 5201), ('black', 5066), ('brown', 3814), ('green', 3750), ('yellow', 2792), ('gray', 2113), ('nothing', 1814), ('right', 1760), ('frisbee', 1641), ('baseball', 1597), ('left', 1563), ('none', 1562), ('tennis', 1502), ('wood', 1449), ('orange', 1425), ('bathroom', 1230), ('pizza', 1203), ('pink', 1201), ('kitchen', 1093), ('cat', 933), ('dog', 890), ('water', 888), ('man', 885), ('skateboarding', 884), ('grass', 879), ('skiing', 866), ('kite', 793), ('silver', 773), ('black and white', 766), ('surfing', 762), ('horse', 708), ('living room', 702), ('skateboard', 701), ('phone', 697), ('snow', 641), ('wii', 636), ('giraffe', 636), ('woman', 632), ('standing', 627), ('surfboard', 622), ('eating', 607), ('cake', 601), ('food', 599), ('apple', 586), ('sunny', 584), ('broccoli', 572), ('table', 564),

#### Answers

In [10]:
print("# Unique Answers: ", len(multiple_choice_answers))
print("\nSome Answers: ", list(np.random.choice(list(multiple_choice_answers), 100)))
print("\nTop 100 Common Answers: ", Counter(answer2count).most_common(100))

# Unique Answers:  22531

Some Answers:  ['congress', 'colonials', 'dragonair', 'african american', 'cigarette', 'comic', 'chiquita and del monte', 'tilted', '3 days', 'cosmic ln', 'tennis clothes', 'emmanuel n photo', 'supply', 'mon-sat 8am-6pm', '007', 'love seat', 'medical', 'posing for photo', 'because they slaughter them for meat', 'syrup', 'changes in traffic', 'circus', 'green bay', 'airplanes', '488', 'taos', '2:14', 'coke and water', 'v', 'paddle', 'sheep and goat', '350', 'instruments', '05:04', 'building sandcastle', 'white, blue, and red', 'riding', 'on bed', 'housecat', 'roman', 'chicken, broccoli, pasta', 'taking stretch', 'spt', 'pillowcase', '617-497-4111', 'burlap', 'a place to stand', 'casino', '1890', 'crochet', 'no ball', 'tusk holes', 'eric berne', 'cake sale', 'chip wagon', 'stay back', '2 brunette', 'near city', 'at beach', 'under mom', 'independent', 'to play', 'boy on right', 'cleanliness', '1st base', 'wwwclaykessackcom', 'uphill', 'apple identification', 'sid

## Dataset Creation

The subset will follow the same structure as the original VQA dataset. This is:

* Answer
    * Question Type
    * Majority Answer
    * Answer Type
    * Answer Candidates
        * Given Answer
        * Confidence
        * Answerer ID
        
        
* Question
    * Question
    * Image ID
   
   
* Images
    * ResNet Image Features (Size: 2048)
    

In order to train your models on your machine with a CPU (or if you have a GPU), we need to reduce the size of the Dataset. We will reduce the original dataset in the following way:
* 20k Q/A of answer type _yes/no_
* 20k Q/A of answer type _number_
* 20k Q/A of answer type _other_

The total number of Q/A will then be 60000. We will divide into training, validation and test split. The ratio between the splits will be approximately: 80%, 15%, 5% respectively.

In [11]:
start_time = time()
idx = list(range(0,len(qdata['questions'])))
random.seed(42)
random.shuffle(idx)

np.random.seed(42)
splits = ['train', 'valid', 'test']

n = 20000
qdata_small = {'questions': list()}
adata_small = {'annotations': list()}
a_type_counts = {'yes/no': 0, 'number': 0, 'other': 0}

while len(qdata_small['questions']) < 3*n:
    i = idx.pop()
    
    at = adata['annotations'][i]['answer_type'] 
    
    if a_type_counts[at] < n:
        
        if at == 'yes/no' and adata['annotations'][i]['multiple_choice_answer'] not in ['yes', 'no']:
            continue
            
        adata_small['annotations'].append(adata['annotations'][i])
        qdata_small['questions'].append(qdata['questions'][i])
        
        split = np.random.choice(splits, p=(.8, .15, .05))
        adata_small['annotations'][-1]['split'] = split
        qdata_small['questions'][-1]['split'] = split
        
        a_type_counts[at] += 1
        
# Tests
assert len(qdata_small['questions']) == len(adata_small['annotations']) == 3*n, "Inconsitent Lengths."
a_type_counts = {'yes/no': 0, 'number': 0, 'other': 0}
for ann in adata_small['annotations']:
    a_type_counts[ann['answer_type']] += 1
assert a_type_counts['yes/no'] == a_type_counts['number'] == a_type_counts['other'] == n, "Inconsistent Answer Type Lengths."

print("Data Creation Looks good! Time Taken %.2f" %(time()-start_time))

Data Creation Looks good! Time Taken 2.33


Let's look at some examples to verify this is the same data. Calculating the statistics again.

#### Annotations Small Dataset

In [12]:
print("# Datapoints: ", len(adata_small['annotations']))
print("\nDatapoint keys: ", adata_small['annotations'][0].keys())
print("\n#1: ", adata_small['annotations'][0])
print("\n#2: ", adata_small['annotations'][1])
print("\n#3: ", adata_small['annotations'][2])

# Datapoints:  60000

Datapoint keys:  dict_keys(['question_type', 'multiple_choice_answer', 'answers', 'image_id', 'answer_type', 'question_id', 'split'])

#1:  {'question_type': 'what', 'multiple_choice_answer': 'tea', 'answers': [{'answer': 'brunch', 'answer_confidence': 'maybe', 'answer_id': 1}, {'answer': 'tea', 'answer_confidence': 'yes', 'answer_id': 2}, {'answer': 'tea time', 'answer_confidence': 'yes', 'answer_id': 3}, {'answer': 'brunch', 'answer_confidence': 'yes', 'answer_id': 4}, {'answer': 'breakfast', 'answer_confidence': 'maybe', 'answer_id': 5}, {'answer': 'tea', 'answer_confidence': 'yes', 'answer_id': 6}, {'answer': 'teatime', 'answer_confidence': 'yes', 'answer_id': 7}, {'answer': 'lunch', 'answer_confidence': 'yes', 'answer_id': 8}, {'answer': 'reception', 'answer_confidence': 'maybe', 'answer_id': 9}, {'answer': 'breakfast', 'answer_confidence': 'yes', 'answer_id': 10}], 'image_id': 228478, 'answer_type': 'other', 'question_id': 228478002, 'split': 'train'}

#2:  

#### Questions Small Dataset

In [13]:
print("# Datapoints: ", len(qdata_small['questions']))
print("\nDatapoint keys: ", qdata_small['questions'][0].keys())
print("\n#1: ", qdata_small['questions'][0])
print("\n#2: ", qdata_small['questions'][1])
print("\n#3: ", qdata_small['questions'][2])

# Datapoints:  60000

Datapoint keys:  dict_keys(['image_id', 'question', 'question_id', 'split'])

#1:  {'image_id': 228478, 'question': 'What English meal is this likely for?', 'question_id': 228478002, 'split': 'train'}

#2:  {'image_id': 540769, 'question': 'Is there a bell on the train?', 'question_id': 540769000, 'split': 'test'}

#3:  {'image_id': 111756, 'question': 'What color is his uniform?', 'question_id': 111756005, 'split': 'train'}


### Dataset Statistics Small Dataset

In [14]:
question_types = set()
multiple_choice_answers = set()
answer2count = defaultdict(int)
answer_types = set()
answertypes2count = defaultdict(int)
top_answers_per_type = defaultdict(lambda: defaultdict(int))
for ann in adata_small['annotations']:
    question_types.add(ann['question_type'])
    
    multiple_choice_answers.add(ann['multiple_choice_answer'])
    
    answer2count[ann['multiple_choice_answer']] += 1
    answer_types.add(ann['answer_type'])
    
    answertypes2count[ann['answer_type']] += 1
    top_answers_per_type[ann['answer_type']][ann['multiple_choice_answer']] += 1

#### Quesiton Types Small Dataset

In [15]:
print("# Unique Question Types: ", len(question_types))
print(question_types)

# Unique Question Types:  65
{'is that a', 'is this', 'does this', 'what color is', 'is', 'is the person', 'what is the name', 'what sport is', 'what does the', 'what animal is', 'how many people are in', 'what number is', 'what is', 'is this an', 'what is in the', 'what color is the', 'was', 'what is the man', 'are these', 'what room is', 'is it', 'is there a', 'what', 'is the man', 'is he', 'what is on the', 'can you', 'what kind of', 'is this a', 'none of the above', 'why is the', 'what brand', 'where is the', 'are there any', 'which', 'what is the color of the', 'what color are the', 'who is', 'what is the woman', 'do', 'what type of', 'are the', 'what are the', 'what color', 'is this person', 'what time', 'are there', 'are', 'what is the person', 'what is this', 'how many people are', 'is the', 'where are the', 'could', 'has', 'how many', 'why', 'is there', 'what are', 'are they', 'do you', 'does the', 'is the woman', 'how', 'what is the'}


#### Answer Types Small Dataset

In [16]:
print("Answer Types: ", answer_types)
print("Answer Type Counts: ", Counter(answertypes2count).most_common())
for t in list(answer_types):
    print("\nType '%s' Top 50 Answers %s" %(t, Counter(top_answers_per_type[t]).most_common(50)))

Answer Types:  {'other', 'number', 'yes/no'}
Answer Type Counts:  [('other', 20000), ('yes/no', 20000), ('number', 20000)]

Type 'other' Top 50 Answers [('white', 823), ('red', 494), ('black', 460), ('blue', 449), ('green', 355), ('brown', 331), ('yellow', 266), ('gray', 190), ('right', 154), ('frisbee', 152), ('nothing', 151), ('left', 144), ('baseball', 134), ('none', 132), ('orange', 130), ('wood', 127), ('tennis', 123), ('pink', 119), ('pizza', 118), ('kitchen', 113), ('bathroom', 106), ('cat', 90), ('water', 86), ('dog', 85), ('skiing', 84), ('grass', 84), ('surfing', 80), ('skateboarding', 78), ('horse', 75), ('black and white', 74), ('kite', 73), ('surfboard', 72), ('silver', 71), ('man', 69), ('living room', 66), ('woman', 65), ('giraffe', 64), ('table', 63), ('wii', 61), ('apple', 58), ('snow', 58), ('phone', 57), ('skateboard', 56), ('hat', 56), ('broccoli', 54), ('snowboarding', 53), ('eating', 53), ('cow', 52), ('standing', 51), ('sunny', 50)]

Type 'number' Top 50 Answers 

#### Answers Small Dataset

In [17]:
print("# Unique Answers: ", len(multiple_choice_answers))
print("\nSome Answers: ", list(np.random.choice(list(multiple_choice_answers), 100)))
print("\nTop 100 Common Answers: ", Counter(answer2count).most_common(100))

# Unique Answers:  5691

Some Answers:  ['38', 'lift', '6 5 4 3', 'happy 50th birthday', 'cutting board', '8 ft', 'cook', 'fresh oil', 'bakery', 'stars and hearts', 'street cleaner', 'ahc 442', 'colorado', 'owner', 'surfing', 'fashion show', 'mile', 'champion', 'headband', 'portable', 'luggage room', 'green and white', '3:10', 'rackets', '10:00 am', 'ducati', 'mocking', 'cemetery', 'grapefruit', 'fire department', 'movement', '2 people', 'hippie drum circle', 'fresh fruit', '7502', 'kite', 'relaxed', 'monday', "o'neill", 'on counter', '100% fatto mano', '365', 'cigar', 'brother', 'bob', '2:28', 'shaggy', 'kitty litter', 'carrot cake', 'horseback riding', 'sandwich and chips', '11:58', 'tennis dress', 'back left', '2 towels', 'hungry', 'behind head', 'james bond', '055', 'crouching', 'one sweet ride', 'fist', 'rainbow', '95', '0870 400 4000', 'boardwalk', '258', '592', '1126', 'bucket in shower', 'overpass', 'old fashioned', 'forsythia', "1940's", 'fast', 'tim hortons', 'jollibee', 'top

## Saving

In [18]:
import gzip

### Splitting

In [19]:
qdata_small_splits = {\
                      'train': {'questions': list()}, 
                      'valid': {'questions': list()}, 
                      'test': {'questions': list()}
                     }

adata_small_splits = {\
                      'train': {'annotations': list()}, 
                      'valid': {'annotations': list()}, 
                      'test': {'annotations': list()}
                     }

for i in range(len(qdata_small['questions'])):
    
    split = qdata_small['questions'][i]['split']
    assert split == adata_small['annotations'][i]['split'], "Inconsistent Splits."
    assert adata_small['annotations'][i]['question_id'] == qdata_small['questions'][i]['question_id'], "Inconsistent IDs."
    
    qdata_small_splits[split]['questions'].append(qdata_small['questions'][i])
    adata_small_splits[split]['annotations'].append(adata_small['annotations'][i])
    
        
print("Training Set Size: %i" %(len(qdata_small_splits['train']['questions'])))
print("\nValidation Set Size: %i" %(len(qdata_small_splits['valid']['questions'])))
print("\nTest Set Size: %i" %(len(qdata_small_splits['test']['questions'])))

Training Set Size: 48061

Validation Set Size: 8977

Test Set Size: 2962


### Write out the files

In [20]:
for split in ['train', 'valid', 'test']:
    
    with gzip.GzipFile('data/vqa_annotatons_' + split + '.gzip', 'w') as file:
        file.write(json.dumps(adata_small_splits[split]).encode('utf-8'))
        
    with gzip.GzipFile('data/vqa_questions_' + split + '.gzip', 'w') as file:
        file.write(json.dumps(qdata_small_splits[split]).encode('utf-8'))

Get list of all image ids

In [21]:
image_ids = set()
for q in qdata_small['questions']:
    image_ids.add(q['image_id'])

image_ids_json = {'image_ids': list(image_ids)}
with open('data/image_ids_vqa.json', 'w') as file:
    json.dump(image_ids_json, file)