# Data Prep

- [ ] Download CIFAR100 data from TF/Keras
- [ ] Write data to disk
- [ ] Normalize Images
- [ ] Randomly split images into train/val/test sets
- [ ] Write normalized train/val/test images to disk

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Download CIFAR100 Images and Labels

In [2]:
%%time
(X_train, y_train_coarse), (X_test, y_test_coarse) = tf.keras.datasets.cifar100.load_data(label_mode='coarse')
(_, y_train_fine), (_, y_test_fine) = tf.keras.datasets.cifar100.load_data(label_mode='fine')

Wall time: 3.05 s


# Create Directories for Data

Can use a yaml config file to set these paths

In [3]:
parent_path = '../data'
download_path = parent_path + '/downloads'
train_val_test_path = parent_path +'/train_val_test'

Create the specified paths if they don't exist

In [4]:
# check if the parent directory exists
if os.path.exists(parent_path):
    print("Parent directory {} already exists".format(parent_path))
    # check if sub directories exist
    if os.path.exists(download_path):
        print("Download path {} already exists".format(download_path))
    else:
        os.makedirs(download_path)
        
    if os.path.exists(train_val_test_path):
        print("Train/Val/Test split path {} already exists".format(train_val_test_path))
else:
    print("Parent directory doesn't exist")
    os.makedirs(parent_path)
    os.makedirs(download_path)
    os.makedirs(train_val_test_path)
    print("Created directories")

Parent directory ../data already exists
Download path ../data/downloads already exists
Train/Val/Test split path ../data/train_val_test already exists


# Write Downloaded Data to Disk

Images

In [5]:
with open(download_path+'/train.npy', 'wb') as f:
    np.save(f, X_train)
    
with open(download_path+'/test.npy', 'wb') as f:
    np.save(f, X_test)

superclasses and classes

In [6]:
# superclasses
with open(download_path+'/train_superclasses.npy', 'wb') as f:
    np.save(f, y_train_coarse)
    
with open(download_path+'/test_superclasses.npy', 'wb') as f:
    np.save(f, y_test_coarse)
    
# classes
with open(download_path+'/train_classes.npy', 'wb') as f:
    np.save(f, y_train_fine)
    
with open(download_path+'/test_classes.npy', 'wb') as f:
    np.save(f, y_test_fine)

# Normalize Data

Images are represented with 3 color channels with pixel values between 0 and 255, inclusive. We can normalize these pixel values to range between 0 and 1 by dividing by 255.

In [7]:
images = np.vstack((X_train, X_test))/255.
print(images.shape, np.min(images), np.max(images))

(60000, 32, 32, 3) 0.0 1.0


# Combine Downloaded Train/Test Labels into DataFrame
We will use this dataframe to create our open-world setup:
1. Train: known superclasses
1. Val: same known superclasses as in the Train data
1. Test:
    1. known superclasses seen in the Train/Val but unlabeled
    1. novel superclasses not seen in the Train/Val images
    
Think of this open-world problem in the following way. At some point in time you are training an image classification model. You have a database of labelled images. At a future point in time you accumulate new images that are unlabeled and you still want to generate predictions, but are aware that these images could either belong to the labels found in the training data, or might be novel and new.

In [8]:
y_train_coarse = [x[0] for x in y_train_coarse]
y_test_coarse = [x[0] for x in y_test_coarse]

y_train_fine = [x[0] for x in y_train_fine]
y_test_fine = [x[0] for x in y_test_fine]

train_labels = pd.DataFrame({'coarse':y_train_coarse,
                             'fine':y_train_fine})

test_labels = pd.DataFrame({'coarse':y_test_coarse,
                            'fine':y_test_fine})

labels = pd.concat([train_labels, test_labels])\
           .reset_index(drop=True)

In [9]:
n_superclasses = len(set(labels.coarse))
n_classes = len(set(labels.fine))
print(n_superclasses, n_classes)

20 100


# Split Data Into Train/Val/Test

A configurable parameter could be the number of superclasses to include in our train/val split

In [10]:
n_train_val_superclasses = int(0.8*(n_superclasses))
n_novel_superclasses = n_superclasses - n_train_val_superclasses
print(n_train_val_superclasses, n_novel_superclasses)

16 4


In [11]:
superclasses = list(set(labels.coarse))

In [12]:
train_val_superclasses = [x for x in np.random.choice(superclasses, n_train_val_superclasses, replace=False)]
novel_superclasses = [x for x in superclasses if x not in train_val_superclasses]
print(train_val_superclasses)
print(novel_superclasses)

[13, 5, 11, 9, 15, 12, 8, 16, 10, 19, 2, 6, 0, 7, 17, 14]
[1, 3, 4, 18]


Get indexes of our train/val/test data

In [13]:
train_val_indexes = [x for x in labels[labels['coarse'].isin(train_val_superclasses)].index]
novel_indexes = [x for x in labels[labels['coarse'].isin(novel_superclasses)].index]
print(len(train_val_indexes), len(novel_indexes))

48000 12000


We need to take some of the train_val_indexes and split into our test data so we have the population of images that belong to the known classes, but are unlabelled. This can also be a configurable parameter.

In [14]:
test_unlabeled_perc = 0.20
val_perc = 0.20

In [15]:
train_val_indexes, test_unlabeled_indexes = train_test_split(train_val_indexes, test_size=test_unlabeled_perc)
train_indexes, val_indexes = train_test_split(train_val_indexes, test_size=val_perc)
test_indexes = test_unlabeled_indexes + novel_indexes

In [16]:
print(labels.shape[0], len(train_indexes), len(val_indexes), len(test_indexes))

60000 30720 7680 21600


Subset images and labels

In [17]:
X_train = images[train_indexes, :, :, :]
train_labels = labels.iloc[train_indexes, :].values

X_val = images[val_indexes, :, :, :]
val_labels = labels.iloc[val_indexes, :].values

X_test = images[test_indexes, :, :, :]
test_labels = labels.iloc[test_indexes, :].values

In [18]:
print((X_train.shape, train_labels.shape), (X_val.shape, val_labels.shape), (X_test.shape, test_labels.shape))

((30720, 32, 32, 3), (30720, 2)) ((7680, 32, 32, 3), (7680, 2)) ((21600, 32, 32, 3), (21600, 2))


# Write Data to Disk

Images

In [19]:
with open(train_val_test_path+'/train.npy', 'wb') as f:
    np.save(f, X_train)
    
with open(train_val_test_path+'/val.npy', 'wb') as f:
    np.save(f, X_train)
    
with open(train_val_test_path+'/test.npy', 'wb') as f:
    np.save(f, X_test)

Labels

In [20]:
with open(train_val_test_path+'/train_labels.npy', 'wb') as f:
    np.save(f, train_labels)
    
with open(train_val_test_path+'/val_labels.npy', 'wb') as f:
    np.save(f, val_labels)
    
with open(train_val_test_path+'/test_labels.npy', 'wb') as f:
    np.save(f, test_labels)