# Step 1 :Import Dataset
In the code cell below, we import a dataset of actress images. We populate a few variables through the use of the load_files function from the scikit-learn library:

1. train_files, valid_files, test_files - numpy arrays containing file paths to images
2. train_targets, valid_targets, test_targets - numpy arrays containing onehot-encoded classification labels
3. actress_names - list of string-valued dog breed names for translating labels

In [13]:
from sklearn.datasets import load_files
from keras.utils import np_utils
import numpy as np

#define function to load train, test and valid datasets
def load_dataset(path):
    data = load_files(path)
    actress_files = np.array(data['filenames'])
    actress_targets = np_utils.to_categorical(np.array(data['target']), 5) #As of now we have 5 actress to comapre with
    return actress_files, actress_targets

#load train,test,valid datasets
train_files, train_targets = load_dataset('../Celebs/train')
test_files, test_targets = load_dataset('../Celebs/test')
valid_files, valid_targets = load_dataset('../Celebs/valid')

print("Total Actress Images {}".format(len(np.hstack([train_files, test_files, valid_files]))))
print("Train Actress Images {}".format(len(train_files)))
print("Test Actress Images {}".format(len(test_files)))
print("Valid Actress Images {}".format(len(valid_files)))

Total Actress Images 4970
Train Actress Images 3978
Test Actress Images 496
Valid Actress Images 496


# Pre-process the data
Here we are using tensorflow as backend for keras and it requires our images as a certain 4D array a.k.a 4D Tensor with shape.

$$
(\text{nb_samples}, \text{rows}, \text{columns}, \text{channels}),
$$

where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

The path_to_tensor takes a string spacifying file location and it does the following operation.

1. Resizes the image as (224,224).
2. Convert the squared image as an array (3d array)
3. Expand the 3d array to 4d array as (1,224,224,3)

Another helper function paths_to_tensor takes an array of image file locations as param and in turn calls path_to_tensor on all of them and then vertically stack the output.

Here, nb_samples is the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!

In [18]:
from keras.preprocessing import image
from tqdm import tqdm_notebook as tqdm

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(224, 224))
    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)
    

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]    
    return np.vstack(list_of_tensors)

In [19]:
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True   

#pre-process the data
train_tensors = paths_to_tensor(train_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255

(3978, 224, 224, 3)


# TODO
1. Create a CNN from scratch with these tensors
2. Create Augmentations and with use of that, And create bottleneck features and save it to S3
3. Fetch bottleneck features from S3 and use transfer learning to build a CNN