# Building HDF5 image dataset

In this notebook, we create a series of datasets for the train, validation and test sets, using HDF5 file. We consider the [Dogs vs Cats Kaggle dataset](https://www.kaggle.com/c/dogs-vs-cats/data). The dataset is composed by $25000$ images on the training set and $12000$ for the test set. We also extract the mean of the colors channels (RGB). We utilize a configuration file to set our directories (input/output) paths, the number of classes and the portion of images to the validation and test set.

**Note**: For this example, we just consider the train set from the Dogs vs Cats dataset.

## Importing Libraries

In [1]:
from config import dogs_vs_cats_config as config
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
#compvis module
from compvis.preprocessing import ResizeAR
from compvis.io import HDF5DatasetWriter
from imutils import paths
import numpy as np
import json
import cv2
import os

## Creating the list of images and their labels.

In [2]:
trainPaths = list(paths.list_images(config.IMAGES_PATH)) # list of paths
trainLabels = [p.split(os.path.sep)[-1].split(".")[0] for p in trainPaths] # list of labels
# we split the path to obtain the labels dog and cat

In [3]:
len(trainPaths)

25000

## Encoding the labels

In [4]:
# Encoding the labels into the intergers
le = LabelEncoder()
trainLabels = le.fit_transform(trainLabels)

## Splitting the list of Paths into train, validation and test set

We split the dataset into train, test and validation. This process is made in two step. In the first step we define the test set, the size for it is $2500$. Note, the length of the trainPaths was changed into $22500$. In the second step, we define the validation set with $2500$, the final length of the train set is $20000$ images.

In [5]:
# training split, a validation split, and a testing split.
split = train_test_split(trainPaths, trainLabels, test_size = config.NUM_TEST_IMAGES,
                         stratify = trainLabels, random_state = 42)
(trainPaths, testPaths, trainLabels, testLabels) = split


In [6]:
len(trainPaths)

22500

In [7]:
# perform another stratified sampling, this time to build the
# validation data
split = train_test_split(trainPaths, trainLabels, test_size = config.NUM_VAL_IMAGES,
                         stratify = trainLabels, random_state = 42)
(trainPaths, valPaths, trainLabels, valLabels) = split

In [8]:
len(trainPaths)

20000

## Building the train, validation and test datasets

In [9]:
# construct a list pairing the training, validation, and testing
# image paths along with their corresponding labels and output HDF5
# files
datasets = [("train", trainPaths, trainLabels, config.TRAIN_HDF5),
            ("val", valPaths, valLabels, config.VAL_HDF5),
            ("test", testPaths, testLabels, config.TEST_HDF5)] # this will be used with in HDF5DatasetWriter
#(name of dataset, The path for the images, list  of label, path to save the model)

In [10]:
# Defining the image preprocessor and the list of channels RGB
aap = ResizeAR(256, 256)
(R, G, B) = ([], [], [])

In [11]:
for (dType, paths, labels, outputPath) in datasets:
    # Creating the HDF5 file
    print("[INFO] building {}...".format(outputPath))
    # the arguments for the dataset writer are a tuple with the total of images and the total of features
    writer = HDF5DatasetWriter((len(paths), 256, 256, 3), outputPath)
    # Loop over the images Paths
    for (i, (path, label)) in enumerate(zip(paths, labels)):
        image = cv2.imread(path)
        image = aap.preprocess(image)
        # if we are building the training dataset, then compute the
        # mean of each channel in the image, then update the
        # respective lists
        if dType == "train":
            (b, g, r) = cv2.mean(image)[:3]
            R.append(r)
            G.append(g)
            B.append(b)
        #The attribute add, write the features and the list of labels in the dataset
        writer.add([image], [label])
    writer.close()

[INFO] building /home/igor/Documents/Artificial_Inteligence/Datasets/Cats and dogs /hdf5/train.hdf5...
[INFO] building /home/igor/Documents/Artificial_Inteligence/Datasets/Cats and dogs /hdf5/val.hdf5...
[INFO] building /home/igor/Documents/Artificial_Inteligence/Datasets/Cats and dogs /hdf5/test.hdf5...


## Saving the json file with the means of RGB channels

In [12]:
# construct a dictionary of averages, then serialize the means to a
# JSON file
print("[INFO] serializing means..." )
D = {"R" : np.mean(R), "G" : np.mean(G), "B" : np.mean(B)}
f = open(config.DATASET_MEAN, "w")
f.write(json.dumps(D))
f.close()

[INFO] serializing means...
