# Prep Dataset for of Images for CSHL Course

## COCO-Search 18
[COCO Search 18](https://saliency.tuebingen.ai/datasets/COCO-Search18/)  


This block demos acessing the JSON file of COCO Search 18. You can access other keys of the train and val (other than 'name') to get the fixation info

In [9]:
import json

# Open the JSON file
with open('./coco_search18/coco_search18_fixations_TP_train_split1.json', 'r') as file:
    train = json.load(file)
# Open the JSON file
with open('./coco_search18/coco_search18_fixations_TP_validation_split1.json', 'r') as file:
    val = json.load(file)
    
train_split = []
# Now 'train' contains the JSON data as a Python dictionary
for d in train:
    train_split.append(d['name'])

val_split = []
# Now 'train' contains the JSON data as a Python dictionary
for d in val:
    val_split.append(d['name'])

print('Example JSON Entry:')
print(train[0])


Example JSON Entry:
{'name': '000000478726.jpg', 'subject': 2, 'task': 'bottle', 'condition': 'present', 'bbox': [1063, 68, 95, 334], 'X': [848.2, 799.2, 731.1, 1114.4, 1121.5], 'Y': [517.2, 476.2, 383.4, 271.1, 205.9], 'T': [73, 193, 95, 635, 592], 'length': 5, 'correct': 1, 'RT': 1159, 'split': 'train'}


There are ducplicate names in our lists because there are mulitple entries for individual images. Remove these.

In [11]:
print('Size Before:')
print('Train: ',len(train_split))
print('Val: ',len(val_split))

train_split = list(set(train_split))
val_split = list(set(val_split))

print('Size After:')
print('Train: ',len(train_split))
print('Val: ',len(val_split))

Size Before:
Train:  21622
Val:  3258
Size After:
Train:  1934
Val:  315


## COCO-Periph

COCO-Periph has a subset of images that are complete with all 4 eccentricities rendered.

I Used this script to pick 100 of the training images that have all 4 eccentricities of coco-periph. You can chose more (there are 990 in total), the full list is in ccp_search_subset_train.csv

You can copy the rest of them to your local machine, as well as the test and validation by downloading the full COCO-Periph dataset from [https://data.csail.mit.edu/coco_periph/](https://data.csail.mit.edu/coco_periph/) and changing the filepath. You may also need to change the fname_80,160,240,320 as well to match the public dataset formatting.

In [12]:
import os

def check_in_cocop(imgnum,filepath='./COCO_Periph'):
    '''
    Check if img has complete cocop dataset

    Parameters:
        imgnum (str): the MS COCO Image number desired
        filepath (str): path to the COCO Periph Directory
    Returns:
        imglist_subset (list of strings): 
    '''
    fname_80 = os.path.join(filepath,'ecc_80',f'{imgnum.zfill(12)}.jpg')
    fname_160 = os.path.join(filepath,'ecc_160',f'{imgnum.zfill(12)}.jpg')
    fname_240 = os.path.join(filepath,'ecc_240',f'{imgnum.zfill(12)}.jpg')
    fname_320 = os.path.join(filepath,'ecc_320',f'{imgnum.zfill(12)}.jpg')
    
    
    if all([os.path.isfile(fname_80),
            os.path.isfile(fname_160),
            os.path.isfile(fname_240),
            os.path.isfile(fname_320)]):

        return(True)
    else:
        return(False)

check_in_cocop('9',filepath='../coco_periph_data/train/')

True

In [16]:
complete_subset_train = []
for img in train_split:
    if check_in_cocop(img.replace('.jpg',''),filepath='../coco_periph_data/train/'):
        complete_subset_train.append(img)
        
print(len(complete_subset_train))

import csv
# Open the file in write mode
with open('./ccp_search_subset_train.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    
    # Write each string as a separate row in the CSV file
    for string in complete_subset_train:
        writer.writerow([string])

990


## Select a subset of these
I used this script to pick a randomly chosen subset of 100 and copy them from our internal server to create the ccp_search18_train_subset folder used for the course.

In [17]:
import random
rand_subset = random.sample(complete_subset_train, 100)

In [18]:
import shutil

eccs = [5,10,15,20]
ppd=16

for img in rand_subset:

    #zero ecc first (original MS coco image)
    src = f'/home/gridsan/groups/datasets/coco/train2017/{img}'
    dst = f'ccp_search18_train_subset/ecc_0/{img}'
    shutil.copyfile(src, dst)
    
    #loop through eccentricities
    for ecc in eccs:
        src = f'/home/gridsan/groups/RosenholtzLab/coco_periph_data/train/ecc_{ecc*ppd}/{img}'
        dst = f'ccp_search18_train_subset/ecc_{ecc}/{img}'
        shutil.copyfile(src, dst)
        