# Hands-on data engineering Vantage AI
In this session we are going to train a convolutional neural network to classify images of the CIFAR-10 Dataset. 

## Assignment
We would like this notebook defined as a clean package following the cookiecutter template. It should consist of a model training part, and a model scoring part (score on the test set).

The training part stores a model and metadata on disk.  
The scoring part uses the stored model to predict the testset and show the results.  
Run with: `python train_model.py {path}` or `python score_model.py {path}`

Use proper error handling, for example for making predictions without a model, or run with a faulty argument.


## Dependency management
This notebook expects you to have the following python dependencies installed:
- Tensorflow (2.0)
- Matplotlib
- SKLearn

## Load data

The data consist of three parts: train, validation and test set

In [1]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import sys
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
project_dir = Path(workbookDir).resolve().parents[1]
sys.path.append(str(project_dir))
from src.data import make_dataset
from src.data.retrieve_data import load_data
from src.data.generate_dataset import generate_train_test_val_dataset
from src.models.train_model import *
from src.models.predict_model import *
from src.models.evaluate_model import *

In [3]:
# data_url = "http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
# load_data(data_url)
# generate_train_test_val_dataset()

2020-03-27 15:34:12.872 | INFO     | src.data.retrieve_data:load_data:17 - Downloading data...
2020-03-27 15:34:27.779 | INFO     | src.data.retrieve_data:load_data:24 - Downloaded data, removing zip file


training set size: data = (40000, 32, 32, 3), labels = (40000,)
validation set size: data = (10000, 32, 32, 3), labels = (10000,)
Test set size: data = (10000, 32, 32, 3), labels = (10000,)


(array([[[[0.23137255, 0.16862746, 0.19607843],
          [0.26666668, 0.38431373, 0.46666667],
          [0.54509807, 0.5686275 , 0.58431375],
          ...,
          [0.49803922, 0.49411765, 0.49803922],
          [0.50980395, 0.5568628 , 0.50980395],
          [0.4627451 , 0.47058824, 0.42745098]],
 
         [[0.12941177, 0.14901961, 0.34117648],
          [0.41568628, 0.4509804 , 0.45882353],
          [0.44705883, 0.4117647 , 0.41960785],
          ...,
          [0.4627451 , 0.54901963, 0.53333336],
          [0.47058824, 0.41960785, 0.34509805],
          [0.2627451 , 0.13725491, 0.1254902 ]],
 
         [[0.38039216, 0.43529412, 0.48235294],
          [0.50980395, 0.53333336, 0.5176471 ],
          [0.47843137, 0.4745098 , 0.49803922],
          ...,
          [0.29803923, 0.41960785, 0.5294118 ],
          [0.5294118 , 0.5058824 , 0.49803922],
          [0.46666667, 0.49019608, 0.5254902 ]],
 
         ...,
 
         [[0.28235295, 0.17254902, 0.16470589],
          [0.20392

# Train model

In [4]:
train_model()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 30, 30, 16)        448       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 32)        4640      
_________________________________________________________________
flatten (Flatten)            (None, 6272)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                62730     
Total params: 67,818
Trainable params: 67,818
Non-trainable params: 0
_________________________________________________________________
None
Train on 10000 samples, validate on 10000 samples
Epoch 1/20
10000/10000 - 2s - loss: 1.7467 - accuracy: 0.3869 - val_

# Predict and evaluate model

In [2]:
split = 'test'
X = f"X_{split}"
y = f"y_{split}"
predictions = predict('X_test')
evaluate_model(predictions, y)

Accuracy = 0.496
              precision    recall  f1-score   support

    airplane       0.53      0.58      0.56      1000
  automobile       0.61      0.56      0.59      1000
        bird       0.39      0.40      0.40      1000
         cat       0.33      0.33      0.33      1000
        deer       0.45      0.38      0.41      1000
         dog       0.41      0.42      0.42      1000
        frog       0.55      0.60      0.57      1000
       horse       0.58      0.56      0.57      1000
        ship       0.58      0.62      0.60      1000
       truck       0.53      0.49      0.51      1000

    accuracy                           0.50     10000
   macro avg       0.50      0.50      0.50     10000
weighted avg       0.50      0.50      0.50     10000

