## Predicting flat (tabular) data

Each pattern is composed by a fixed number of features (converted into numbers!).  
That's why **tabular**: each pattern can be seen as a 2D table (flat).

## Exercise: Boston Housing price regression dataset

Very popular toy dataset: https://keras.io/api/datasets/boston_housing/  
Already provided by Keras for you.

Try to tackle this problem with a Deep Learning model built with Keras. 

**Objectives**: 
*   make sure you are able to build a deep learning model 
*   make sure you can code a full training loop with a bit of hyperparameter search (model selection vs. model assessment)
*   make sure you are able to monitor training progress
*   make sure you can apply a minimum level of preprocessing (try to explore the dataset and see what kind of features you are dealing with)
*   make sure you can evaluate your model on unseen data
*   make sure you are able to save your model and load it back

This will be needed when moving forward on more complex tasks on structured data like images, sequences etc. You can reuse the code so that you can then focus on the specific task instead of the *boilerplate* code part which is always the same.

**Optional**: feel free to explore other flat datasets!  
TF datasets: https://www.tensorflow.org/datasets/catalog/overview#all_datasets  
UCI datasets: https://archive.ics.uci.edu/ml/index.php

For example (from UCI): 

HIGGS (~2.6 GB) https://archive.ics.uci.edu/ml/datasets/HIGGS  
Binary classification task to distinguish between a signal process which produces Higgs bosons and a background process which does not.
**Large dataset**: 11 000 000 patterns, 28 real-valued features

Bank Marketing Dataset (use `bank_full.csv` file for patterns) https://archive.ics.uci.edu/ml/datasets/Bank+Marketing  
Binary classification task to predict if a client will subscribe a term deposit.  
It has 45211 patterns with 17 features.

In [None]:
import tensorflow.keras as K
import tensorflow as tf
import numpy as np
from tensorflow.keras.datasets import boston_housing

In [None]:
# 2 tuples, each of which containing 2 numpy arrays
(x, y), (x_test, y_test) = boston_housing.load_data(test_split=0.15)
print(x.shape, y.shape, x_test.shape, y_test.shape)
print(x.dtype, y.dtype, x_test.dtype, y_test.dtype)

Quite a small dataset, but it's enough for us to get started.

Do your best!

## Solution example

Split your dataset!!

In [None]:
from sklearn.model_selection import train_test_split
VAL_SPLIT = 0.15

# why don't we use stratify here?
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=VAL_SPLIT, shuffle=True)

Normalize your data!

In [None]:
print(np.mean(x_train), np.var(x_train))

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization(axis=-1)
norm_layer.adapt(x_train)
normalized_x_train, normalized_x_val, normalized_x_test = norm_layer(x_train), norm_layer(x_val), norm_layer(x_test)

In [None]:
print(np.mean(normalized_x_train), np.var(normalized_x_train))
print(np.mean(normalized_x_val), np.var(normalized_x_val)) # of course, this is a less precise normalization

Build the model

In [None]:
def get_compiled_model(hidden_size, learning_rate):
  optimizer = K.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
  model = K.Sequential()
  model.add(K.layers.Input(shape=(13,)))
  model.add(K.layers.Dense(hidden_size, activation='tanh'))
  model.add(K.layers.Dense(1))
  model.compile(optimizer=optimizer, loss='mse')
  return model

In [None]:
model = get_compiled_model(256, 1e-4)

In [None]:
model.fit(normalized_x_train, y_train, epochs=10, batch_size=10)

In [None]:
metrics = model.evaluate(normalized_x_val, y_val)
print(metrics)

## Grid search for model selection

In [None]:
from sklearn.model_selection import ParameterGrid

# 6 configurations total
choices = {"learning_rate": [1e-2, 1e-3, 1e-4], "hidden_size": [128, 256]}
grid = ParameterGrid(choices)
print(len(list(grid)), list(grid))

In [None]:
best_loss = 100000
best_conf = None
for el in grid:
  print("Training with configuration: ", el["learning_rate"], el["hidden_size"])
  model = get_compiled_model(el["hidden_size"], el["learning_rate"])
  model.fit(normalized_x_train, y_train, epochs=10, batch_size=10, verbose=0)
  loss = model.evaluate(normalized_x_val, y_val)
  print("Loss: ", loss)
  if loss < best_loss:
    print("Found better configuration")
    best_loss = loss
    best_conf = el
print(best_loss)
print(best_conf)

## Model assessment

You can retrain on the union of training and validation.

In [None]:
normalized_x = norm_layer(x)
model = get_compiled_model(el["hidden_size"], best_conf["learning_rate"])
model.fit(normalized_x, y, epochs=10, batch_size=10)
loss = model.evaluate(normalized_x_test, y_test)
print("Final loss: ", loss)

model.save('model.h5')