## Candidate Generator

The query of interest is: **The highlighted boxes depict a person biking on a street** 

Each candidate is a `(Bbox,Bbox)` tuple. Each _candidate hash_ is a string of the format "[set_name]:[image_idx]:[person_idx]:[bike_idx]" such that it uniquely maps to a candidate. Cirrently, there is no way to go from a `(Bbox,Bbox)` tuple to a candidate hash (but this should not be hard to implement). 

In [7]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

In [21]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
%matplotlib inline

import pandas as pd
from snorkel.contrib.babble.image import BBox

## Load Annotations
Load the train and validation annotations for this task.

In [9]:
import os

# anns_folder = '/dfs/scratch0/paroma/coco/annotations/'
anns_folder = os.environ['SNORKELHOME'] + '/experiments/babble/bike/data/'
train_path = anns_folder + 'train_anns.npy'
val_path = anns_folder + 'val_anns.npy'

train_anns = np.load(train_path).tolist()
val_anns = np.load(val_path).tolist()

# NEW IMAGE PARSER 

In [10]:
from snorkel.models import candidate_subclass

Biker = candidate_subclass('Biker', ['person', 'bike'])

In [29]:
from snorkel.parser import ImageCorpusExtractor, CocoPreprocessor

coco_preprocessor = CocoPreprocessor(train_path, source=0)
corpus_extractor = ImageCorpusExtractor(candidate_class=Biker)
%time corpus_extractor.apply(coco_preprocessor)
coco_preprocessor = CocoPreprocessor(val_path, source=1)
%time corpus_extractor.apply(coco_preprocessor, clear=False)

Clearing existing...
Running UDF...
CPU times: user 1.06 s, sys: 24 ms, total: 1.08 s
Wall time: 1.08 s
Running UDF...
CPU times: user 471 ms, sys: 23.9 ms, total: 495 ms
Wall time: 484 ms


In [30]:
for split in [0, 1]:
    num_candidates = session.query(Biker).filter(Biker.split == split).count()
    print("Split {} candidates: {}".format(split, num_candidates))

Split 0 candidates: 2406
Split 1 candidates: 1037


## New Task Instructions
For each image, we now want to:
* Create separate lists of box indices that are person and that are bike
* Create list of tuples of "person,bike"
* Iterate through pairs to check which ones overlap, only save those

Re-calculate the number of candidate pairs and the class balance.

### Generate Candidate Hash and Dictionaries
One for Gold Labels and One for BBox Objects

In [25]:
def create_bbox_candidates(set_name, anns, img_idx,pidx,bidx):
    p_bbox = BBox(anns[pidx], img_idx)
    b_bbox = BBox(anns[bidx], img_idx)
    return (p_bbox, b_bbox)
    

In [26]:
def create_candidate_dict(candidate_dict,set_name):
    if set_name == 'val':
        anns = val_anns
    elif set_name == 'train':
        anns = train_anns

    for i in xrange(len(anns)):

        #Find all valid person-bike pairs for given object
        person_indices, bike_indices = get_person_bike_boxes(anns[i])
        person_bike_tuples = [(x,y) for x in person_indices for y in bike_indices]
        valid_pairs = get_valid_pairs(anns[i], person_bike_tuples)

        #Generate candidates for each valid pair
        for j in xrange(len(valid_pairs)):
            candidate_hash = set_name+':%d:%d:%d'%(i,valid_pairs[j][0],valid_pairs[j][1])
            candidate_dict[candidate_hash] = create_bbox_candidates(set_name, anns[i], i, valid_pairs[j][0], valid_pairs[j][1])
            
    
    return candidate_dict

In [27]:
candidate_dict = create_candidate_dict({},'val')
print 'Candidates in Validation Set: ', len(candidate_dict)
candidate_dict = create_candidate_dict(candidate_dict,'train')
print 'Total Candidates: ', len(candidate_dict)

Candidates in Validation Set:  1037
Total Candidates:  3443


The labels by candidate dictionary only has keys for candidates with a label. If the label does not exist, it's either in the train set of a bad validation candidate.

The `labels_by_candidate.npy` is created in `tutorials/babble/image/MTurk_io.ipynb` by parsing the MTurk CSV.

In [31]:
labels_by_candidate = np.load(anns_folder+'labels_by_candidate.npy').tolist()

In [32]:
labels_by_candidate

defaultdict(list,
            {'val:0:16:0': False,
             'val:0:16:1': False,
             'val:0:16:2': False,
             'val:0:3:1': False,
             'val:102:1:0': False,
             'val:103:11:1': False,
             'val:103:2:1': False,
             'val:103:2:10': False,
             'val:103:2:12': False,
             'val:103:2:14': False,
             'val:103:3:0': False,
             'val:103:4:0': False,
             'val:103:4:13': False,
             'val:103:5:14': False,
             'val:103:6:13': False,
             'val:103:7:0': False,
             'val:103:7:13': False,
             'val:103:8:1': False,
             'val:103:8:10': False,
             'val:103:9:14': False,
             'val:105:14:0': False,
             'val:105:4:0': False,
             'val:105:5:0': True,
             'val:106:0:11': True,
             'val:106:5:12': True,
             'val:106:6:14': True,
             'val:106:6:3': True,
             'val:107:1:0': True,