# Hands-on data engineering Vantage AI
In this session we are going to train a convolutional neural network to classify images of the CIFAR-10 Dataset. 

## Assignment
We would like this notebook defined as a clean package following the cookiecutter template. It should consist of a model training part, and a model scoring part (score on the test set).

The training part stores a model and metadata on disk.  
The scoring part uses the stored model to predict the testset and show the results.  
Runnen with: `python train_model.py {path}` or `python score_model.py {path}`

Use proper error handling, for example for making predictions without a model, or run with a faulty argument.


## Dependency management
This notebook expects you to have the following python dependencies installed:
- Tensorflow (2.0)
- Matplotlib
- SKLearn

Exercise: _Write a `requirements.txt` in which the dependencies of this notebook can be installed easily._

## Load data

The data consist of three parts: train, validation and test set.

Exercise: _There is many repative code when loading data. Divide this in practical and readable code. Think about the engineering principles we discussed during the session_

In [1]:
import numpy as np
import os
import tarfile
from urllib.request import urlretrieve
import pickle
import random


def load_data():
    # training set, batches 1-4
    if not os.path.exists(os.path.join(os.getcwd(), "data")):
        os.makedirs(os.path.join(os.getcwd(), "data"))

        
    dataset_dir = os.path.join(os.getcwd(), "data")
    
    if not os.path.exists(os.path.join(dataset_dir, "cifar-10-batches-py")):
        print("Downloading data...")
        urlretrieve("http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz", os.path.join(dataset_dir, "cifar-10-python.tar.gz"))
        tar = tarfile.open(os.path.join(dataset_dir, "cifar-10-python.tar.gz"))
        tar.extractall(dataset_dir)
        tar.close()
        
    X_train = np.zeros((40000, 32, 32, 3), dtype="float32")
    y_train = np.zeros((40000, 1), dtype="ubyte").flatten()
    n_samples = 10000  # aantal samples per batch
    dataset_dir = os.path.join(dataset_dir,"cifar-10-batches-py")
    for i in range(0,4):
        f = open(os.path.join(dataset_dir, "data_batch_"+str(i+1)), "rb")
        cifar_batch = pickle.load(f,encoding="latin1")
        f.close()
        X_train[i*n_samples:(i+1)*n_samples] = (cifar_batch['data'].reshape(-1, 32, 32, 3) / 255.).astype("float32")
        y_train[i*n_samples:(i+1)*n_samples] = np.array(cifar_batch['labels'], dtype='ubyte')

    # validation set, batch 5
    f = open(os.path.join(dataset_dir, "data_batch_5"), "rb")
    cifar_batch_5 = pickle.load(f,encoding="latin1")
    f.close()
    X_val = (cifar_batch_5['data'].reshape(-1, 32, 32, 3) / 255.).astype("float32")
    y_val = np.array(cifar_batch_5['labels'], dtype='ubyte')

    # labels
    f = open(os.path.join(dataset_dir, "batches.meta"), "rb")
    cifar_dict = pickle.load(f,encoding="latin1")
    label_to_names = {k:v for k, v in zip(range(10), cifar_dict['label_names'])}
    f.close()

    # test set
    f = open(os.path.join(dataset_dir, "test_batch"), "rb")
    cifar_test = pickle.load(f,encoding="latin1")
    f.close()
    X_test = (cifar_test['data'].reshape(-1, 32, 32, 3) / 255.).astype("float32")
    y_test = np.array(cifar_test['labels'], dtype='ubyte')


    print("training set size: data = {}, labels = {}".format(X_train.shape, y_train.shape))
    print("validation set size: data = {}, labels = {}".format(X_val.shape, y_val.shape))
    
    print("Test set size: data = "+str(X_test.shape)+", labels = "+str(y_test.shape))

    return X_train, y_train, X_val, y_val, X_test, y_test, label_to_names


### Preprocessing
CIFAR-10 does not require much preprocessing. Normalizing data is often a good idea, we calculate the average pixel value in advance, we normalize the data based on that value. 

Exercise: _Its a good idea to store the mean and std in a pickle file, which can be used for inference so that we do not require to load the whole dataset._

In [2]:
nr_channels = 3
image_size = 32
nr_classes = 10
epochs = 20

X_train, y_train, X_val, y_val, X_test, y_test, label_to_names = load_data()

# Conv nets trainen duurt erg lang op CPU, dus we gebruiken maar een klein deel
# van de data nu, als er tijd over is kan je proberen je netwerk op de volledige set te runnen
X_train = X_train[:10000]
y_train = y_train[:10000]

def calc_mean_std(X):
    mean = np.mean(X)
    std = np.std(X)
    return mean, std

def normalize(data, mean, std):
    return (data-mean)/std

#De data van train_X is genoeg om de mean en std van de hele set nauwkeurig te benaderen
mean,std = calc_mean_std(X_train)
X_test = normalize(X_test,mean,std)
X_val = normalize(X_val,mean,std)
X_train = normalize(X_train ,mean,std)


Downloading data...
training set size: data = (40000, 32, 32, 3), labels = (40000,)
validation set size: data = (10000, 32, 32, 3), labels = (10000,)
Test set size: data = (10000, 32, 32, 3), labels = (10000,)


# Define model
We use a convolutional neural net as our model, which is very basic. Since the data science is not the focus of this course, we do not pay much attention to this. It is good to know what is going on though, so don't hesistate to ask questions.

In [3]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Dropout, MaxPooling2D, Input
from sklearn.metrics import classification_report

def conv_net():
    input_layer = Input(shape=X_train.shape[1:])
    conv = Conv2D(filters=16, kernel_size=(3,3), padding='valid',
                  data_format='channels_last', activation='relu')(input_layer)
    conv = Conv2D(filters=32, kernel_size=(3,3), padding='valid',
                  data_format='channels_last', activation='relu', strides=(2, 2))(conv)

    flatten = Flatten()(conv)
    output_layer = Dense(units=nr_classes, activation='softmax')(flatten)

    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])   
    return model


model = conv_net()

model.fit(x=X_train, y=y_train, batch_size=50, epochs=1, validation_data=(X_val, y_val), verbose=2)
predictions = np.array(model.predict(X_test, batch_size=100))
test_y = np.array(y_test, dtype=np.int32)

#Take the highest prediction
predictions = np.argmax(predictions, axis=1)

# Store model on disk
model.save('model.h5')

#Print results
print("Accuracy = {}".format(np.sum(predictions == y_test) / float(len(predictions))))
print(classification_report(y_test, predictions, target_names=list(label_to_names.values())))

Train on 10000 samples, validate on 10000 samples
10000/10000 - 2s - loss: 1.7349 - accuracy: 0.3864 - val_loss: 1.5719 - val_accuracy: 0.4505
Accuracy = 0.4529
              precision    recall  f1-score   support

    airplane       0.41      0.73      0.52      1000
  automobile       0.57      0.46      0.51      1000
        bird       0.41      0.32      0.36      1000
         cat       0.37      0.20      0.26      1000
        deer       0.40      0.41      0.40      1000
         dog       0.35      0.50      0.41      1000
        frog       0.49      0.53      0.51      1000
       horse       0.46      0.50      0.48      1000
        ship       0.72      0.38      0.50      1000
       truck       0.53      0.50      0.52      1000

    accuracy                           0.45     10000
   macro avg       0.47      0.45      0.45     10000
weighted avg       0.47      0.45      0.45     10000



In [4]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 30, 30, 16)        448       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 32)        4640      
_________________________________________________________________
flatten (Flatten)            (None, 6272)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                62730     
Total params: 67,818
Trainable params: 67,818
Non-trainable params: 0
_________________________________________________________________
