# Prepare data files

This notebook arranges the data files into the appropriate structure for running the keras model. It also splits the data into training and validation sets as well as creates a small sample set for quick testing.

Keras requires the data to be split into separate directories for each class so in each data set folder, there will be a `dog/` folder containting all images of the dog class and a `cat/` folder containing all images of the cat class.

All in all the following folders will be created:

    data/train/cat/
    data/train/dog/
    data/valid/cat/
    data/valid/dog/
    data/sample/train/cat/
    data/sample/train/dog/
    data/sample/valid/cat/
    data/sample/valid/dog/
    
The code assumes that the data is downloaded correctly from the Kaggle cats vs. dogs redux competition and stored in a folder called `data/`. Only the training data will be manipulated and it should start off in `data/train/`

In [27]:
import os
import glob
import shutil
import math
import numpy as np

Create the `sample/` folder and initialize it to be a smaller version of the current `data/` folder (i.e. an equal mix of cat and dog files in a subfolder call `train/`. This will allow us to reuse the code that prepares the main files on the sample files.

In [112]:
# WARNING: If you already have files in the data/sample/train folder, this cell will 
#          permanently delete those files

sample_folder = 'data/sample/train'
if os.path.exists(sample_folder):
    #Delete the files
    shutil.rmtree(sample_folder)    
os.makedirs(sample_folder)

sample_size = 100
sample_size = int(math.ceil(sample_size/2.)*2) #i.e. force sample_size to be even
files_cat = np.random.choice(glob.glob('data/train/cat*'),size=sample_size/2,replace=False)
files_dog = np.random.choice(glob.glob('data/train/dog*'),size=sample_size/2,replace=False)
for source_file in np.concatenate((files_cat,files_dog)):
    shutil.copy2(source_file,sample_folder)

In [113]:
def split_train_valid(path,train_size):
    '''
    path is a directory containing cat and dog files in equal proportion
    train_size is the percentage of samples to be assigned as training samples. It is a float between 0 and 1
    '''
    path_train = os.path.join(path,'train')
    path_valid = os.path.join(path,'valid')
    if os.path.exists(path_valid):
        print("WARNING: There is already a folder named valid in this path. This function should only be run if the folder is in the correct initial state as specified at the beginning of the notebook.")
    else:
        os.makedirs(path_valid)
        m = len([name for name in os.listdir(path_train) if os.path.isfile(os.path.join(path_train, name))])
        m_train = math.ceil(m*0.5*train_size)
#         m_valid = math.ceil(m*0.5*(1-train_size))

        def move_files(target,class_name):
            '''
            target is 'train' or 'valid'
            class_name is 'cat' or 'dog'  (singular)
            '''
            if target == 'train':
                files = np.random.choice(glob.glob('{}/{}*'.format(path_train,class_name)),size=m_train,replace=False)
            elif target == 'valid':
                files = glob.glob('{}/{}*.jpg'.format(path_train,class_name))
            else:
                raise ValueError("target must be either 'train' or 'valid")
            destination_folder = os.path.join(path,target,''.join([class_name,'s']))
            os.makedirs(destination_folder)
            for source_file in files:
                destination_file = os.path.join(destination_folder,source_file.split('/')[-1])
                os.rename(source_file,destination_file)
        
        move_files('train','cat')
        move_files('train','dog')
        move_files('valid','cat')
        move_files('valid','dog')


In [114]:
split_train_valid('data/sample/',0.92)



In [115]:
split_train_valid('data/',0.92)

