1. Data Pre-Processing
1. NN Model
1. [Validation Set](#valset)
1. [Manual Hyperparameter Tuning](#opt)
    1. [`learning_rate`](#lr)
    1. [`batch_size`](#bs)
    1. [`epochs`](#epochs)
    1. [Changing the Model](#layers)
1. [Automated Hyperparameter Tuning](#auto)
    1. [GridSearch](#gs)
    1. [RandomSearch](#rs)
    1. [Regularization: `Dropout`](#reg)
    1. [Baselines](#baselines)

In [67]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# import data
df = pd.read_csv('insurance.csv')

# split Xs and Ys
X = df.iloc[:, 0:6]
y = df.iloc[:, -1]

# one-hot encode categorical values
X = pd.get_dummies(X)

# split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [8]:
# from sklearn.preprocessing import Normalizer
# from sklearn.compose import ColumnTransformer

# ct = ColumnTransformer([('normalize', Normalizer(), ['age', 'bmi', 'children'])], remainder='passthrough')

# X_train_norm = ct.fit_transform(X_train)
# X_test_norm = ct.transform(X_test)

# X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
# X_train_norm.head()

In [69]:
from sklearn.compose import ColumnTransformer

# instantiate CT
ct = ColumnTransformer([('standardize', StandardScaler(), ['age', 'bmi', 'children'])], remainder='passthrough')

# normalize numerical vars
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)
X_train_scaled.head()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,0.265106,-0.913375,-0.912607,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,-0.0165,0.795456,0.747689,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.405909,-0.007962,-0.082459,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,-1.424533,0.394165,-0.912607,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,1.461934,1.564598,-0.912607,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


In [28]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense
from tensorflow.keras.optimizers import Adam

# instantiate model
model = Sequential(name='my_model')

# instantiate & add input layer to the model
input_layer = InputLayer(input_shape=(X.shape[1], ))
model.add(input_layer)

# instantiate & add hidden layers
model.add(Dense(128, activation='relu'))

# instantiate & add output layer
model.add(Dense(1))

# check model
print(model.summary())

# compile model
model.compile(loss='mse', metrics=['mae'], optimizer=Adam(learning_rate=0.001))

# train model
model.fit(X_train_scaled, y_train, epochs=50, batch_size=3, verbose=1)

# evaluate model
val_mse, val_mae = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"MSE: {val_mse}\nMAE: {val_mae}")

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_5 (Dense)             (None, 128)               1536      
                                                                 
 dense_6 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,665
Trainable params: 1,665
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50


<a name="valset"></a>
# 1. Validation Set

Using the training data to choose hyperparameters might lead to __overfitting__ to the training data meaning the model learns patterns specific to the training data that would not apply to new data. 

<img src="https://content.codecademy.com/courses/deeplearning-with-tensorflow/hyperparameter-tuning/hyperparameter-tuning-diagram.png" alt="train_pipeline" style="width: 65%;"/>

For that reason, hyperparameters are chosen on a held-out set called __validation set__. In TensorFlow Keras, validation split can be specified as a parameter in the `.fit()` function.

In [13]:
# compile model
model.compile(loss='mse', metrics=['mae'], optimizer=Adam(learning_rate=0.001))

# train model
model.fit(X_train_scaled, y_train, epochs=50, batch_size=10, verbose=0, validation_split=0.2)

# evaluate model
val_mse, val_mae = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"MSE: {val_mse}\nMAE: {val_mae}")

MSE: 20626376.0
MAE: 2698.529052734375


<a name="manual"></a>
# 2. Manual Hyperparameter Tuning

<a name='lr'></a>
## 2.1 Learning Rate

A __larger__ `learning_rate` leads to a __faster learning process__ at a __cost to be stuck in a local minimum__. A __smaller__ `learning_rate` might produce a __good suboptimal or global solution__, but it will take it __much longer to converge__. In the extremes, a `learning_rate` __too large__ will lead to an __unstable learning process oscillating__ over the epochs. A `learning_rate` __too small__ may __not converge or get stuck__ in a local minimum.

<a name='bs'></a>
## 2.2 Batch Size

The `batch_size` is a hyperparameter that determines __how many training samples are seen before updating the network’s parameters__ (weight and bias matrices).

When the batch contains __all the training examples__, the process is called __batch gradient descent__. If the batch has __one sample__, it is called the __stochastic gradient descent__. And finally, when __1 < `batch_size` < number of training points__, is called __mini-batch gradient descent__. An advantage of using batches is for GPU computation that can parallelize neural network computations.

A __larger__ `batch_size` will provide our model with __better gradient estimates__ and a solution close to the optimum, but this comes __at a cost of computational efficiency and good generalization__ performance. __Smaller__ `batch_size` is a __poor estimate of the gradient__, but the learning is performed __faster__. 

Finding the __“sweet spot”__ depends on the dataset and the problem, and can be determined through hyperparameter tuning.

When using a __larger__ `batch_size` it is usually good to __increase__ `learning_rate`.

<a name='epochs'></a>
## 3.3 Epochs

`epochs` is a hyperparameter representing the number of complete passes through the training dataset. This is typically a large number (100, 1000, or larger). If the data is split into batches, __in one epoch the optimizer will see all the batches__.

__Too many__ epochs can lead to __overfitting__, and __too few__ to __underfitting__. One trick is to use `EarlyStopping`: when the training performance reaches the plateau or starts degrading, the learning stops.

In [19]:
from tensorflow.keras.callbacks import EarlyStopping

# instantiate EarlyStopping
es = EarlyStopping(
    # monitor validation loss
    monitor='val_loss',
    # we seek minimal loss
    mode='min', verbose=1,
    # if plateu, continue for 40 more in case it improves after it
    patience=40)

history = model.fit(X_train_scaled, y_train, epochs=500,
                    batch_size=16, verbose=1, validation_split=0.2,
                    callbacks=[es])

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 00043: early stopping


<a name='layers'></a>
## 3.4 Chaning the Model

The rule of thumb is to __start with one hidden layer and add as many units as we have features in the dataset__. However, this might not always work. We need to try things out and __observe our learning curve__.

<a name="auto"></a>
# 3. Automated Hyperparameter Tuning

<a name='gs'></a>
## 3.1 GridSearch

Grid search, or exhaustive search, tries __every combination of desired hyperparameter values__. This obviously gets very __computationally demanding__ when we increase the number of values per hyperparameter or the number of hyperparameters we want to tune.

To use `GridSearchCV` from `scikit-learn` for regression we need to first __wrap our NN model__ into a `KerasRegressor`.

In [33]:
def design_model():
    model = Sequential(name='my_model')
    input = Input(shape=(X.shape[1],))
    model.add(input)
    # nodes = number of Xs
    model.add(Dense(11, activation='relu'))
    # output layer, 1 node per sample
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.01), loss='mse', metrics=['mae'])
    return model

In [45]:
from scikeras.wrappers import KerasRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from tensorflow.keras.layers import Dense, InputLayer
from tensorflow.keras import Input

# wrap model
model = KerasRegressor(model=design_model)

# define hyperparamter values
bs = [10, 40]
ep = [10, 50]
param_grid = dict(batch_size=bs, epochs=ep)

# instantiate GS
gs = GridSearchCV(estimator = model,
                  param_grid=param_grid,
                  scoring=make_scorer(mean_squared_error, greater_is_better=False))

# extract results
gs_res = gs.fit(X_train_scaled, y_train, verbose=0)



In [48]:
gs.best_estimator_

KerasRegressor(
	model=<function design_model at 0x000001872AB67280>
	build_fn=None
	warm_start=False
	random_state=None
	optimizer=rmsprop
	loss=None
	metrics=None
	batch_size=10
	validation_batch_size=None
	verbose=1
	callbacks=None
	validation_split=0.0
	shuffle=True
	run_eagerly=False
	epochs=50
)

<a name='rs'></a>
## 3.2 RandomSearch

Random Search goes through __random combinations of hyperparameters__ and doesn’t try them all. Thus, we change our hyperparameter grid specification for the randomized search in order to have `more options`.

In [57]:
# check RS parameters
rs.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__model', 'estimator__build_fn', 'estimator__warm_start', 'estimator__random_state', 'estimator__optimizer', 'estimator__loss', 'estimator__metrics', 'estimator__batch_size', 'estimator__validation_batch_size', 'estimator__verbose', 'estimator__callbacks', 'estimator__validation_split', 'estimator__shuffle', 'estimator__run_eagerly', 'estimator__epochs', 'estimator', 'n_iter', 'n_jobs', 'param_distributions', 'pre_dispatch', 'random_state', 'refit', 'return_train_score', 'scoring', 'verbose'])

In [58]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

# define hyperparameter values
param_grid = {'batch_size': sp_randint(2, 16), 'epochs': sp_randint(10, 100)}

# wrap model
model = KerasRegressor(model=design_model)

# instantiate RS
rs = RandomizedSearchCV(estimator=model,
                          param_distributions=param_grid,
                          scoring=make_scorer(mean_squared_error, greater_is_better=False),
                          n_iter=12)

rs.fit(X_train_scaled, y_train, verbose=0)



RandomizedSearchCV(estimator=KerasRegressor(model=<function design_model at 0x000001872AB67280>),
                   n_iter=12,
                   param_distributions={'batch_size': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001872F1ECC40>,
                                        'epochs': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001872DFA8B80>},
                   scoring=make_scorer(mean_squared_error, greater_is_better=False))

In [59]:
rs.best_estimator_

KerasRegressor(
	model=<function design_model at 0x000001872AB67280>
	build_fn=None
	warm_start=False
	random_state=None
	optimizer=rmsprop
	loss=None
	metrics=None
	batch_size=3
	validation_batch_size=None
	verbose=1
	callbacks=None
	validation_split=0.0
	shuffle=True
	run_eagerly=False
	epochs=59
)

<a name='reg'></a>
## 3.3 Regularization: dropout

Regularization is a set of techniques that __prevent the learning process to completely fit the model to the training data__ which can lead to overfitting. It makes the model simpler, smooths out the learning curve, and hence makes it more ‘regular’. 

There are many techniques for regularization such as __simplifying the model__, adding __weight regularization__, __weight decay__, and so on. The most common regularization method is `Dropout`.

`Dropout` is a technique that __randomly ignores a number of outputs of a layer by setting them to zeros__. The __dropout rate__ is the percentage of layer outputs set to zero (usually between 20% to 50%).

In Keras, we can add a dropout layer by introducing the `Dropout` layer.

In [62]:
from tensorflow.keras import layers

def design_model_dropout():
    model = Sequential(name='my_first_model')
    input = Input(shape=(X.shape[1],))
    model.add(input)
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dropout(0.1))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(32, activation='relu'))
    model.add(layers.Dropout(0.3))
    model.add(layers.Dense(1))
    model.compile(loss='mse', metrics=['mae'], optimizer=Adam(learning_rate=0.001))
    return model

In [63]:
# wrap model
model = KerasRegressor(model=design_model_dropout)

# define hyperparamter values
bs = [10, 40]
ep = [10, 50]
param_grid = dict(batch_size=bs, epochs=ep)

# instantiate GS
gs = GridSearchCV(estimator = model,
                  param_grid=param_grid,
                  scoring=make_scorer(mean_squared_error, greater_is_better=False))

# extract results
gs_res = gs.fit(X_train_scaled, y_train, verbose=0)



<a name='baselines'></a>
## 3.4 Baselines

A __baseline result__ is the __simplest possible prediction__ (__null accuracy__). 

For some problems, this may be a random result, and for others, it may be the most common class prediction. Since we are focused on a regression task, we can use averages or medians of the class distribution known as __central tendency measures__ as the result for all predictions.

Scikit-learn provides `DummyRegressor`, which serves as a __baseline regression algorithm__. We’ll choose `mean` as our central tendency measure.

In [70]:
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error

# instantiate DR
dr = DummyRegressor(strategy='mean')

# fit DR
dr.fit(X_train_scaled, y_train)

# predict on test data
y_pred = dr.predict(X_test)

# check baseline
MAE_baseline = mean_absolute_error(y_test, y_pred)

MAE_baseline

9190.331083088173

The result of the baseline is \$9190 and our previous model had a `val_mae` of \$2866, so it performed much better.