# Lab 2: Data Engineering
## Exercise 1: Create a dataset

Until now, we have used the CIFAR-10 dataset from Keras. Every script started by preprocessing this dataset into a subset with the four classes that we chose. Because CIFAR-10 is quite small, this doesn't take much time, so it didn't bother us.

Imagine you have a gigantic dataset. This is quite normal when you are doing machine learning in the real world. Preprocessing it takes a long time, so we really don't want to do it for every training experiment. What's the solution? ClearML Datasets!

Let's create a dataset in ClearML for our subset of CIFAR-10. This is done by running a script to preprocess the data, write it to the file system, and then upload it to ClearML. We need to write data to files first because ClearML datasets are always created from files. You can write any Python data to files using `pickle.dump()`, or when you have NumPy arrays, use `numpy.save()`.

To create a ClearML dataset use `Dataset.create()`:

https://clear.ml/docs/latest/docs/references/sdk/dataset#datasetcreate \
https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_python

Afterward, go to the ClearML website, click on DATASETS on the left side, and find the dataset you just created.

We won't do this now, but later we can use [Dataset.get()](https://clear.ml/docs/latest/docs/references/sdk/dataset/#datasetget) to download our new dataset before training.

In [None]:
!pip install -q clearml nbconvert

%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
%env CLEARML_API_ACCESS_KEY=<your key here>
%env CLEARML_API_SECRET_KEY=<your key here>
# TODO fill in your keys    ^^^^^^^^^^^^^^^

import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
from clearml import Dataset

# download the dataset
(images, labels), _ = keras.datasets.cifar10.load_data()

# there are 10 classes of images
all_classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# choose four classes (feel free to change this!)
class_names = ["bird", "cat", "deer", "dog"]
print("Class names:", class_names)

# only keep images of these classes
class_indexes = [all_classes.index(c) for c in class_names]
to_keep = np.array([l in class_indexes for l in labels])
images = images[to_keep]
labels = labels[to_keep]

# change indexes from 10 to 2 classes
labels = np.array([class_indexes.index(l) for l in labels])

# normalize pixels between 0 and 1
images = images / 255.0

# ---
# TMP Only store 10 images/labels, because we want fast upload.
# This way we can learn the dataset SDK without waiting too long.
images = images[:10]
labels = labels[:10]
# ---

# split into train and test set
split = round(len(images) * 0.8)
train_images = images[:split]
train_labels = labels[:split]
test_images = images[split:]
test_labels = labels[split:]
print("Number of train images:", len(train_images))
print("Number of test images:", len(test_images))

# TODO save numpy arrays to disk
np.save(...)
np.save(...)
np.save(...)
np.save(...)

# TODO create ClearML dataset
dataset = Dataset.create(...)
dataset.add_files(...)
dataset.add_files(...)
dataset.add_files(...)
dataset.add_files(...)

dataset.upload()
dataset.finalize()