### Preliminary Data Exploration
Author: catwong@ 12/27/2018

Env: Python 2 (no virtualenv)
Datasets:
- Regex, Learning with Latent Language (Andreas et. al) [https://github.com/jacobandreas/l3/tree/master/data]
- Spatial Navigation (Janner et. al)
[https://github.com/JannerM/spatial-reasoning]
- CLEVR-Humans (Johnson et. al) [https://cs.stanford.edu/people/jcjohns/iep/]

### Analyze Datasets

#### Utility Functions

In [1]:
from preliminary.exploration_utils import *

In [4]:

        
# #train_hint = ngram_dataset_freq(l3_regex['train'], 'hint', n=1, verbose=True)
# local_sr_fdist = ngram_dataset_freq(local_sr['train'], 'hints_aug', verbose=True)
# clevr_fdist = ngram_dataset_freq(clevr_humans['train'], 'tokenized', verbose=False)

# _ = ngram_cross_dataset_freq([local_sr_fdist, clevr_fdist], verbose=True)

## Datasets 

In [3]:
from data.dataset_loading import *

#### L3-Regex

l3_regex: dict with keys {train, test, val}; each list of dicts with keys:
- examples: actual I/O pairs.
- hint: the actual NLP examples.
- hints_aug: templated, augmented.
- re: the regex

In [3]:
l3_regex=load_l3(verbose=True)

l3_regex: 
train: 3000 tasks
val: 500 tasks
test: 500 tasks


In [3]:
# Frequency Distributions

train_hint = ngram_dataset_freq(l3_regex['train'], 'hints_aug', verbose=True)
test_hint = ngram_dataset_freq(l3_regex['test'], 'hints_aug', verbose=True)

train_hint = ngram_dataset_freq(l3_regex['train'], 'hints_aug', n=2, verbose=True)
test_hint = ngram_dataset_freq(l3_regex['test'], 'hints_aug', n=2, verbose=True)

Basic summary: 
train: 3000 tasks
val: 500 tasks
test: 500 tasks


#### Spatial Reasoning - Janner Version

To load up to max_train train maps and max_val val maps with mode = [ local | global ] instructions and annotations = [ human | synthetic ] descriptions, run:

~~~~
>>> import data
>>> train_data, val_data = data.load(mode, annotations, max_train, max_val)
>>> layouts, objects, rewards, terminal, instructions, values, goals = train_data
~~~~
Local: 1566 train, 399 test
Global

In [4]:
local_sr, global_sr = load_sr(verbose=True)


<Data> Loading local train environments with human annotations
<Data> Found 1566 annotations

<Data> Loading local test environments with human annotations
<Data> Found 399 annotations

<Data> Loading global train environments with human annotations
<Data> Found 1071 annotations

<Data> Loading global test environments with human annotations
<Data> Found 272 annotations
Found 1566 train instructions.
Found 399 test instructions.
Found 1071 train instructions.
Found 272 test instructions.


In [None]:
print ("LOCAL:")
_= ngram_dataset_freq(local_sr['train'], 'hints_aug', verbose=True)
_= ngram_dataset_freq(local_sr['test'], 'hints_aug', verbose=True)

_= ngram_dataset_freq(local_sr['train'], 'hints_aug', n=2, verbose=True)
_= ngram_dataset_freq(local_sr['test'], 'hints_aug', n=2, verbose=True)

print ("\nGLOBAL:")
_= ngram_dataset_freq(global_sr['train'], 'hints_aug', verbose=True)
_= ngram_dataset_freq(global_sr['test'], 'hints_aug', verbose=True)

_= ngram_dataset_freq(global_sr['train'], 'hints_aug', n=2, verbose=True)
_= ngram_dataset_freq(global_sr['test'], 'hints_aug', n=2, verbose=True)


### CLEVR-Humans

Note: official paper preprocessing is available here. https://github.com/facebookresearch/clevr-iep/blob/master/TRAINING.md

Format: JSON files have keys ['info', 'questions']; questions is a list with format:
```
{u'answer': u'yes', u'question': u'Is there a blue cylinder?', u'split': u'train', u'image_index': 1429, u'image_filename': u'CLEVR_train_001429.png'}
```

In [23]:
import json 

def tokenize(s, delim=' ',
      add_start_token=True, add_end_token=True,
      punct_to_keep=[';', ','], punct_to_remove=['?', '.']):
    """Taken from Johnson et. al"""
    s = s.lower()
    if punct_to_keep is not None:
        for p in punct_to_keep:
            s = s.replace(p, '%s%s' % (delim, p))
    if punct_to_remove is not None:
        for p in punct_to_remove:
            s = s.replace(p, '')
    tokens = s.split(delim)

    return tokens

paths = ["./data/clevr_humans/CLEVR-Humans-%s.json" % split for split in ("train", "test", "val")]

clevr_humans = {}
for split in ('train', 'test', 'val'):
    path = "./data/clevr_humans/CLEVR-Humans-%s.json" % split
    json_data = open(path).read()
    clevr_humans[split] = json.loads(json_data)['questions']
    print("Found %d questions in %s" % (len(clevr_humans[split]), split))
    # Tokenize
    for j, example in enumerate(clevr_humans[split]):
        clevr_humans[split][j]['tokenized'] = tokenize(clevr_humans[split][j]['question'])
        

Found 17817 questions in train
Found 7145 questions in test
Found 7202 questions in val


In [37]:
print("TRAIN")
_= ngram_dataset_freq(clevr_humans['train'], 'tokenized', verbose=True)
_= ngram_dataset_freq(clevr_humans['train'], 'tokenized', n=2, verbose=True)

TRAIN
Printing for ngram, n=1
Num descriptions: 17817
Description avg: 8, med: 8, min: 4, max: 35
Vocabulary size: 990
Ngrams with freq > 10: 293
Total ngram in corpus: 155133
50 most common: (not including letters): [((u'the',), 20305), ((u'is',), 9668), ((u'are',), 8744), ((u'what',), 7279), ((u'color',), 5413), ((u'of',), 5090), ((u'how',), 4956), ((u'many',), 4947), ((u'there',), 4497), ((u'objects',), 4180), ((u'object',), 3829), ((u'shape',), 2889), ((u'same',), 2564), ((u'cube',), 2249), ((u'in',), 1970), ((u'cylinder',), 1906), ((u'large',), 1874), ((u'shiny',), 1771), ((u'cubes',), 1680), ((u'small',), 1569), ((u'sphere',), 1563), ((u'that',), 1442), ((u'cylinders',), 1438), ((u'metallic',), 1422), ((u'to',), 1386), ((u'as',), 1278), ((u'red',), 1257), ((u'matte',), 1238), ((u'purple',), 1182), ((u'green',), 1173), ((u'material',), 1169), ((u'blue',), 1159), ((u'any',), 1140), ((u'spheres',), 1128), ((u'and',), 1100), ((u'all',), 1081), ((u'ball',), 1051), ((u'yellow',), 896),

### Cross Domain Frequency Analyses

#### Spatial Reasoning (Janner) and CLEVR-Humans

In [60]:
print("Spatial Reasoning Local and CLEVR-Humans")
local_sr_fdist = ngram_dataset_freq(local_sr['train'], 'hints_aug', verbose=False)
clevr_fdist = ngram_dataset_freq(clevr_humans['train'], 'tokenized', verbose=False)
_ = ngram_cross_dataset_freq([local_sr_fdist, clevr_fdist], verbose=True)

local_sr_fdist = ngram_dataset_freq(local_sr['train'], 'hints_aug', n=2, verbose=False)
clevr_fdist = ngram_dataset_freq(clevr_humans['train'], 'tokenized', n=2, verbose=False)
_ = ngram_cross_dataset_freq([local_sr_fdist, clevr_fdist], verbose=True)

Spatial Reasoning Local and CLEVR-Humans
Cross dataset frequency for 2 datasets.
Original vocabulary sizes are [196, 990]
Combined vocabulary size is 1069; intersected vocab is: 117
Intersection ngrams with freq > 10: 89
50 most common: (not including letters): [((u'the',), 22085), ((u'is',), 9786), ((u'are',), 8745), ((u'of',), 5785), ((u'object',), 3830), ((u'same',), 2565), ((u'to',), 2070), ((u'in',), 1995), ((u'that',), 1555), ((u'and',), 1540), ((u'a',), 1437), ((u'red',), 1270), ((u'blue',), 1229), ((u'purple',), 1216), ((u'green',), 1207), ((u'left',), 1168), ((u'two',), 1065), ((u'right',), 1042), ((u'yellow',), 910), ((u'one',), 780), ((u',',), 639), ((u'brown',), 470), ((u'on',), 456), ((u'other',), 456), ((u'most',), 396), ((u'square',), 395), ((u'above',), 390), ((u'from',), 375), ((u'which',), 344), ((u'between',), 339), ((u'with',), 293), ((u'next',), 276), ((u'by',), 274), ((u'closest',), 269), ((u'circle',), 255), ((u'it',), 251), ((u'diamond',), 230), ((u'gold',), 223

In [63]:
print("Spatial Reasoning Global and CLEVR-Humans")
global_sr_fdist = ngram_dataset_freq(global_sr['train'], 'hints_aug', verbose=False)
clevr_fdist = ngram_dataset_freq(clevr_humans['train'], 'tokenized', verbose=False)
_ = ngram_cross_dataset_freq([global_sr_fdist, clevr_fdist], verbose=True)

global_sr_fdist = ngram_dataset_freq(global_sr['train'], 'hints_aug', n=2, verbose=False)
clevr_fdist = ngram_dataset_freq(clevr_humans['train'], 'tokenized', n=2, verbose=False)
_ = ngram_cross_dataset_freq([global_sr_fdist, clevr_fdist], verbose=True)

Spatial Reasoning Global and CLEVR-Humans
Cross dataset frequency for 2 datasets.
Original vocabulary sizes are [191, 990]
Combined vocabulary size is 1079; intersected vocab is: 102
Intersection ngrams with freq > 10: 81
50 most common: (not including letters): [((u'the',), 21904), ((u'is',), 9719), ((u'of',), 5509), ((u'object',), 3833), ((u'same',), 2565), ((u'to',), 2187), ((u'in',), 1984), ((u'that',), 1478), ((u'a',), 1411), ((u'as',), 1279), ((u'red',), 1259), ((u'and',), 1123), ((u'all',), 1082), ((u'left',), 969), ((u'right',), 881), ((u'two',), 873), ((u'most',), 667), ((u',',), 614), ((u'or',), 597), ((u'items',), 591), ((u'one',), 532), ((u'on',), 460), ((u'other',), 458), ((u'go',), 426), ((u'square',), 408), ((u'which',), 344), ((u'farthest',), 324), ((u'only',), 311), ((u'from',), 293), ((u'between',), 266), ((u'closest',), 266), ((u'next',), 255), ((u'it',), 245), ((u'move',), 228), ((u'both',), 214), ((u'above',), 196), ((u'with',), 189), ((u'furthest',), 184), ((u'blo