# Split into train/dev/test

1. Split Coco `train2017` into train and validation sets
    1. Split into **train** and **dev** sets using 50/50 from each of the disjoint categories
    2. Select 20 at random from each of the categories of interest for `train1`
    3. Select 100 at ramdom from each of the categories of interest for `dev1`
2. Use Coco `val2017` as test set

Save image IDs into `.txt` files

**Note,** we use balanced `train1` and `dev1` to remove the class imbalance present in **train** and **dev** sets. This is done so that we can solve one problem at a time.

## Imports

In [1]:
%matplotlib inline
from pycocotools.coco import COCO

import json
import numpy as np
import pandas as pd
import skimage.io as io
import matplotlib.pyplot as plt
import pylab
pylab.rcParams['figure.figsize'] = (8.0, 10.0)

In [2]:
%cd ./utils/
from config import CATEGORIES_OF_INTEREST, TRAIN_IMGS_PER_CLASS, DEV_IMGS_PER_CLASS
print(CATEGORIES_OF_INTEREST, TRAIN_IMGS_PER_CLASS, DEV_IMGS_PER_CLASS)
%cd ..

/Users/gbatch/Documents/projects/current/cocoapi/PythonAPI/utils
['bird', 'cat', 'dog', 'person'] 128 128
/Users/gbatch/Documents/projects/current/cocoapi/PythonAPI


## Load `strCatNms_to_imgIds` mapping from `train2017` to do (1.A)

In [3]:
dataDir='..'
dataType='train2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir, dataType)

In [4]:
# initialize COCO api for instance annotations
coco=COCO(annFile)

loading annotations into memory...
Done (t=7.20s)
creating index...
index created!


In [5]:
with open(f'../my_annotations/strCatNms_to_imgIds_{dataType}.json', 'r') as f:
    strCatNms_to_imgIds = json.load(f)
strCatNms_to_ImgIdsNum = {key: len(value) for (key, value) in strCatNms_to_imgIds.items()}
strCatNms_to_ImgIdsNum

{'': 46223,
 'bird': 2351,
 'bird cat': 63,
 'bird cat dog': 4,
 'bird cat dog person': 3,
 'bird cat person': 3,
 'bird dog': 24,
 'bird dog person': 36,
 'bird person': 753,
 'cat': 3199,
 'cat dog': 155,
 'cat dog person': 46,
 'cat person': 641,
 'dog': 2153,
 'dog person': 1964,
 'person': 60669}

In [6]:
train_ids = []
dev_ids = []


for key in strCatNms_to_imgIds:
    img_ids = strCatNms_to_imgIds[key].copy()
    n = len(img_ids)
    
    # shuffle, but set seed for reproducibility
    np.random.seed(42)
    np.random.shuffle(img_ids)
    
    cutoff = int(n*0.5)
    
    train_ids_in_this_cat = img_ids[:cutoff]
    dev_ids_in_this_cat = img_ids[cutoff:]
    
    print(key, len(train_ids_in_this_cat), len(dev_ids_in_this_cat))
    print('\t', train_ids_in_this_cat[:5], dev_ids_in_this_cat[:5])
    
    train_ids.extend(train_ids_in_this_cat)
    dev_ids.extend(dev_ids_in_this_cat)

 23111 23112
	 [125616, 249356, 137123, 15684, 149598] [358788, 479011, 152038, 35952, 329711]
bird 1175 1176
	 [351852, 575758, 64300, 412355, 97195] [267710, 372343, 538976, 134328, 470525]
bird cat 31 32
	 [293757, 569975, 318594, 379620, 236941] [478586, 264476, 467386, 573881, 22907]
bird cat dog 2 2
	 [456438, 99645] [87456, 108923]
bird cat dog person 1 2
	 [345434] [392035, 257909]
bird cat person 1 2
	 [244933] [321861, 173814]
bird dog 12 12
	 [374564, 367699, 451976, 373346, 105918] [109907, 164114, 389138, 207597, 335581]
bird dog person 18 18
	 [178431, 39081, 457442, 64233, 16957] [278303, 341892, 26713, 16775, 298360]
bird person 376 377
	 [135045, 519899, 248793, 303318, 449071] [316658, 490585, 94156, 234600, 357210]
cat 1599 1600
	 [235700, 207282, 386619, 342244, 39171] [536067, 182903, 493295, 440650, 100586]
cat dog 77 78
	 [173825, 143824, 316008, 70754, 117108] [333819, 526664, 544760, 506296, 342150]
cat dog person 23 23
	 [307423, 481212, 427965, 530811, 124122

In [7]:
len(train_ids), len(dev_ids)

(59138, 59149)

In [8]:
len(train_ids) + len(dev_ids)

118287

In [9]:
len(train_ids) / (len(train_ids) + len(dev_ids))

0.499953502920862

## For each of the generic categories of interest choose 20 images and put into `train1_ids`

In [10]:
TRAIN_IMGS_PER_CLASS

128

In [11]:
train1_ids = []

for catNm in CATEGORIES_OF_INTEREST:
    catIds = coco.getCatIds(catNms=catNm)
    imgIds = coco.getImgIds(catIds=catIds)
    train_ids_in_this_cat = set(imgIds).intersection(set(train_ids))
    print(catNm, catIds, len(imgIds), len(train_ids_in_this_cat))
    
    np.random.seed(42)
    random_ids = np.random.choice(list(train_ids_in_this_cat), TRAIN_IMGS_PER_CLASS)
    
    
    train1_ids.extend(random_ids)
    
print(len(train1_ids))


# add IMGS_PER_CLASS images the rest
rest_ids_all = strCatNms_to_imgIds['']
train_ids_rest = set(rest_ids_all).intersection(set(train_ids))

np.random.seed(42)
random_ids_rest = np.random.choice(list(train_ids_rest), TRAIN_IMGS_PER_CLASS)
print('rest', '[no specific id]', len(rest_ids_all), len(train_ids_rest))
train1_ids.extend(random_ids_rest)


#print(type(train1_ids[0]))
train1_ids = np.array(train1_ids).tolist() # make sure to use python native int - makes saving with json possible
#print(type(train1_ids[0]))

print(len(train1_ids))

bird [16] 3237 1616
cat [17] 4114 2054
dog [18] 4385 2191
person [1] 64115 32055
512
rest [no specific id] 46223 23111
640


## For each of the generic categories of interest choose 100 images and put into `train1_ids`

In [12]:
CATEGORIES_OF_INTEREST

['bird', 'cat', 'dog', 'person']

In [13]:
DEV_IMGS_PER_CLASS

128

In [14]:
dev1_ids = []

for catNm in CATEGORIES_OF_INTEREST:
    catIds = coco.getCatIds(catNms=catNm)
    imgIds = coco.getImgIds(catIds=catIds)
    dev_ids_in_this_cat = set(imgIds).intersection(set(dev_ids))
    print(catNm, catIds, len(imgIds), len(dev_ids_in_this_cat))
    
    np.random.seed(42)
    random_ids = np.random.choice(list(dev_ids_in_this_cat), DEV_IMGS_PER_CLASS)
    
    dev1_ids.extend(random_ids)
    
print(len(dev1_ids))


# add IMGS_PER_CLASS images the rest
rest_ids_all = strCatNms_to_imgIds['']
dev_ids_rest = set(rest_ids_all).intersection(set(dev_ids))

np.random.seed(42)
random_ids_rest = np.random.choice(list(dev_ids_rest), DEV_IMGS_PER_CLASS)
print('rest', '[no specific id]', len(rest_ids_all), len(dev_ids_rest))
dev1_ids.extend(random_ids_rest)


#print(type(train1_ids[0]))
dev1_ids = np.array(dev1_ids).tolist() # make sure to use python native int - makes saving with json possible
#print(type(dev1_ids[0]))

print(len(dev1_ids))

bird [16] 3237 1621
cat [17] 4114 2060
dog [18] 4385 2194
person [1] 64115 32060
512
rest [no specific id] 46223 23112
640


In [15]:
%ls ../my_splits/

dev1_ids.txt    dev_ids.txt     test_ids.txt    train1_ids.txt  train_ids.txt


### Get ids from `val2017` and save them as `test_ids`

In [16]:
dataDir='..'
dataType='val2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir, dataType)

In [17]:
# initialize COCO api for instance annotations
coco=COCO(annFile)

loading annotations into memory...
Done (t=0.23s)
creating index...
index created!


In [18]:
len(coco.getImgIds())

5000

In [19]:
test_ids = coco.getImgIds()

### Save `train`, `dev`, `train1`, and `test` ids

In [20]:
with open('../my_splits/train_ids.txt', 'w') as f:
    f.write(json.dumps(train_ids))
with open('../my_splits/dev_ids.txt', 'w') as f:
    f.write(json.dumps(dev_ids))
    
with open('../my_splits/train1_ids.txt', 'w') as f:
    f.write(json.dumps(train1_ids))
with open('../my_splits/dev1_ids.txt', 'w') as f:
    f.write(json.dumps(dev1_ids))
    
with open('../my_splits/test_ids.txt', 'w') as f:
    f.write(json.dumps(test_ids))

In [22]:
#Now read the file back into a Python list object
with open('../my_splits/train_ids.txt', 'r') as f:
    train_ids = json.loads(f.read())
with open('../my_splits/dev_ids.txt', 'r') as f:
    dev_ids = json.loads(f.read())

with open('../my_splits/train1_ids.txt', 'r') as f:
    train1_ids = json.loads(f.read())
with open('../my_splits/dev1_ids.txt', 'r') as f:
    dev1_ids = json.loads(f.read())
    
with open('../my_splits/test_ids.txt', 'r') as f:
    test_ids = json.loads(f.read())

**Note**, all sets have their annotation files in `../my_annotations/`
* train and development set in files with`*_train2017*`
* test set in files with `*_val2017*`