In [None]:
from tensorflow import keras
import pandas as pd
import seaborn as sns

# Deep Learning for tabular data

## Summary of Deep Learning Methods

Today we will cover the most fundamental application of deep learning:
- **Feedforward** Neural Networks maximize flexibility. They are appropriate in cases where we shouldn't make assumptions about relationships between our input features. They do especially well on tabular data, like the dataframes we've seen so far. 

Other important techniques of deep learning:
- **Embeddings** are a technique for learning efficient relationships between categories.
- **Convolutional** Neural Networks are use to model spatial relationships. They are particularly useful in image and audio tasks but have many more applications.
- **Recurrent** Neural Networks are used to model sequences of data, like sentences or time series.

## Purpose of this notebook

- Demonstrate best practices for deep learning on tabular data.
- Discuss common techniques for applying and improving deep learning models

## Dataset

We will load in a the Ames Housing Data, split into train and test sets, and build some models. (www.amstat.org/publications/jse/v19n3/decock.pdf). 

In the following cells I will perform the basic data cleaning steps

In [None]:
# I'm going to load in the data and take care of the data cleaning here.

ames_df=pd.read_csv("http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt", sep='\t')



y_orig = ames_df["SalePrice"]
x_orig = ames_df.drop(columns="SalePrice")

ames_dummies = pd.get_dummies(x_orig, dummy_na=True)
ames_dummies = ames_dummies.fillna(0)

x_orig = x_orig.select_dtypes('number')
x_orig = x_orig.fillna(0)


x_orig = x_orig.merge(ames_dummies, left_index=True, right_index=True)
x_orig = x_orig.astype(float)

<mark>**Important**: we previously said that neural nets require minimal pre-processing. One of those processing steps is scaling our input data. It's recommended that you scale (normalize) all input data when using deep learning/neural nets.</mark>

In [None]:
from sklearn import preprocessing, model_selection


x, x_test, y, y_test = model_selection.train_test_split(x_orig,y_orig)

scaler = preprocessing.StandardScaler()
x = scaler.fit_transform(x)
x_test = scaler.fit_transform(x_test)

We'll make a test set on which we can compare our models

In [None]:


from sklearn import metrics

def benchmark(model):
    y_pred = model.predict(x_test)

    print(f"mae: {metrics.mean_absolute_error(y_test, y_pred):,.2f}")
    print(f"mse: {metrics.mean_squared_error(y_test, y_pred):,.2f}")


## Baseline model

When we care about performance, random forests are a great off-the-shelf model for tabular data.

In [None]:
from sklearn import ensemble


rf = ensemble.RandomForestRegressor(n_estimators=100)

rf.fit(x,y)

In [None]:
benchmark(rf)

We'll use this RF as our baseline. Let's train a plain neural net for to do the same thing.

## Simple neural net


Let's build a two-layer network just as we did in the Neural Net Theory notebook. The difference here is that we will not use an activation function on the output.


<mark>For regression problems, you will typically use a linear (no) activation function on your final layer.</mark>

In [None]:
ff_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=x.shape[1:]),
    keras.layers.Dense(units=5, activation="relu"),
    keras.layers.Dense(units=1),
])
ff_model.compile("sgd", loss="mean_absolute_error", metrics=["mean_squared_error"])
ff_model.summary()

In [None]:
ff_model.fit(x, y, epochs=20)

In [None]:
benchmark(ff_model)

![](https://www.dropbox.com/s/46x057it18kuhh3/2019-03-01_09-16-00.png?raw=1)
Here's a look at some new things used in the above cell


## Check for understanding

**Question**: the output layer of the model above has 6 trainable parameters. Where does that number come from?

Each dense layer includes the linear transformation and possibly an activation function

$$f_a(\mathbf{w}^\top\mathbf{x} + \mathbf{b})$$

We know that the input to the second layer has a dimensionality of $[5 \times n]$ where $n$ is the number of examples in our minibatch. We also know that the output of the second layer is $[1 \times n]$. These both come from setting `units=5` in the first layer and `units=1` in the second.

what do $\mathbf{w}$ and $\mathbf{b}$ need to look like in order to scale the units from 5 to 1? $\mathbf{w}$ will need to be $[5 \times 1]$ and $\mathbf{b}$ will need to be $[1 \times 1]$. That leaves 5 trainable parameters in $\mathbf{w}$ and just one in $\mathbf{b}$.

# Helpful tools 🛠

Okay, so that's our basic, vanilla neural network. It's a good starting point. Let's look at a few additions that might help us.

## Dropout

```python
keras.layers.Dropout(0.05)
```

Dropout is a layer that randomly selects a percentage of the weights and sets them to zero. The percentage above is 5%. Dropout only applies in the training phase of the model and is turned of when we use `model.predict(...)`.

Dropout is part of a broad view of regularization that has become more common with Deep Learning techniques. Before you might have defined regularization as something like "adding weights to your loss function". From now on, we'll use regularization to refer to any technique that penalizes training performance in order to improve test performance.


> <mark>"Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error."</mark> [Deep Learning](https://www.deeplearningbook.org/) by Goodfellow, Bengio and Courville

## Weight regularization

```python
keras.layers.Dense(units=300, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(0.01)),
```

You can still use regularization on the weights of your neural network with the `kernel_regularizer=` parameter. This works just like in ridge/lasso regression that you learned about before. Regularization can also be applied to the bias (less common) and the activation for each layer.

## Advanced optimizers

```python
test_model.compile(
    keras.optimizers.adam(lr=0.001), loss="mean_absolute_error",
    metrics=["mean_squared_error"])
```

We previously used the default optimizer ("sgd" for stochastic gradient descent) but we have a lot of options. You can use any of the optimizers from the keras package and set their parameters yourself. 

[See the list of keras optimizers.](https://keras.io/optimizers/)

**Setting the learning rate**: If you set `lr` too high, the model won't learn because the . Too low and it won't learn because it isn't moving far enough. The default values are often good places to start. <mark>Tip: it often works best to set the learning rate to the highest rate where the model still improves with each epoch.</mark>

**Good default**: a good default is to start with `keras.optimizers.adam`. This is a strong-performing advanced optimizer that incorporates momentum and adapts the learning rate for each parameter.  

## Callbacks

Callbacks are helpful functions that are run after an epoch or batch. There are callbacks in keras and they are very, very helpful. 

[See the list of keras callbacks.](https://keras.io/callbacks/)

```python
test_model.fit(
    x, y, epochs=100, batch_size=100, validation_split=.25, verbose=1,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=8, verbose=1, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=.5, patience=3, verbose=1),
    ])
```

### `ReduceLROnPlateau`

![](https://www.dropbox.com/s/bplh2nbnyaus682/2019-03-01_10-26-06.png?raw=1)

`ReduceLROnPlateau` reduces the learning rate after the model stops improving. You set how aggressively this happens with the `patience` parameter.
 
Why would you want to do this? The learning rate balances how quickly the model descends the loss gradient vs how precisely it does so. A high learning rate moves more quickly but cannot precisely find minima. A low learning rate can carefully fall into a minima but it does so very slowly. In practice, modelers have found that using learning rate schedules is a helpful way to allow your model to quickly find minima at the beginning of training and then later more precisely target those minima to improve performance. In the figure here (reproduced from [Clevert, Unterthiner & Hochreiter](https://arxiv.org/abs/1511.07289)) you can see a common pattern where learning plateaus at one learning rate and then accelerates again as soon as the learning rate is lowered.

`ReduceLROnPlateau` is convenient because it allows you to perform this automatically without designing a learning rate schedule by hand. 

### `EarlyStopping`

How many epochs should you tell your model to use? There's no good answer to this question but thankfully we can use `EarlyStopping` to tell the model to stop whenever performance no longer improving. <mark>Set `patience` here to be higher than `ReduceLROnPlateau` if you're using both.</mark> With `restore_best_weights=True` the model will restore the weights of the best epoch after it's finished. 

In [None]:
test_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=x.shape[1:]),
    keras.layers.Dense(units=300, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.01),
    keras.layers.Dense(units=200, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dense(units=100, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dense(units=50, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dense(units=1),
])
test_model.compile(
    keras.optimizers.Adam(lr=0.001), loss="mean_absolute_error",
    metrics=["mean_squared_error"])
test_model.summary()

In [None]:
test_model.fit(
    x, y, epochs=100, validation_split=.25, verbose=1,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=8, verbose=1, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=.5, patience=3, verbose=1),
    ])

In [None]:
benchmark(test_model)

# In-class exercise (you code)

The model above is overfitting. How can you tell?


In the cell below, build a similar model but address the problem of overfitting. See if this improves performance on the test set.

You should name your model `student_model`. Feel free to start with the boilerplate below

```python
student_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=x.shape[1:]),
    ...
])
student_model.compile(
    keras.optimizers.adam(lr=0.001), loss="mean_absolute_error",
    metrics=["mean_squared_error"])
student_model.summary()
```

<small>[Note from Sophie Searcy]</small>  
I often find that L2 regularization is effective for tabular regression tasks. The trick is to make sure you are scaling your L2 multiplier correctly. You want to set the multiplier so that your model pays attention to both your regression loss and regularization loss. 

Also note that when we use regularization, we can no longer assume that the model loss (which is your total loss for your model) and regression loss (the loss for your regression task) are the same. In the case below, the loss reported after each epoch is now `mean_absolute_error+regularization`. To get a measure of the regression loss, I add `mean_absolute_error` back into the metrics. This way I know what each component of the loss is after each epoch. 

The last epoch for the solution below should log something like `loss: 20266.6511 - mean_absolute_error: 8115.6820`. With that information I know that my regularization loss is `20266.6511 - 8115.6820=12150.9690`. Even though the regularization loss is most of the total loss at the end, the model is still overfitting! That's okay, because we still are getting validation and test performance that's beating our random forest above 😉.

Also note that we are using both dropout and L2 regularization. I find that the combination tends to work better that using one of them alone but your mileage may vary!

In [None]:
### BEGIN SOLUTION

 

L2 = 50
DROP = 0.05
student_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=x.shape[1:]),
    keras.layers.Dense(units=140, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(DROP),
    keras.layers.Dense(units=120, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(DROP),
    keras.layers.Dense(units=100, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(DROP),
    keras.layers.Dense(units=80, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(DROP),
    keras.layers.Dense(units=60, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(.5*DROP),
    keras.layers.Dense(units=40, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
    keras.layers.Dropout(.5*DROP),
    keras.layers.Dense(units=20, activation="relu",
                       kernel_regularizer=keras.regularizers.l2(L2)),
#     keras.layers.Dropout(DROP),
    keras.layers.Dense(units=1),
])
student_model.compile(
    keras.optimizers.Adam(lr=0.01), loss="mean_absolute_error",
    metrics=["mean_absolute_error", "mean_squared_error"])
student_model.summary()

### END SOLUTION

In [None]:
student_model.fit(
    x, y, epochs=500, validation_split=.25, verbose=1, callbacks=[
        keras.callbacks.EarlyStopping(
            patience=16,
            verbose=1,
        ),
        keras.callbacks.ReduceLROnPlateau(
            factor=.5,
            patience=5,
            verbose=1,
        ),
    ])

In [None]:
benchmark(student_model)

# Question: addressing overfitting in deep learning

Given the tools you've seen so far, if you have a deep learning model that's overfitting, how can you address this?

# Question: addressing overfitting in deep learning

- Reduce parameters by reducing the number of units in each layer or reducing the number of layers.
- Add dropout layers (or increase the dropout rate)
- Add regularization (or increase the strength)
- There are others that we haven't discussed yet!

# Aside: How do you choose the size of each layer?

The model in the above solution has the following layer sizes.

- (input) 387
- 120
- 100
- 80
- 60
- 40
- 20
- 1 (output layer)

How should you choose these sizes when designing your own model?

First, your input layer size is fixed by your input. This is also true of your output layer size which is fixed based on the task. Here's two examples of common tasks:
- Regression on a single variable: `units=1`
- Categorization with multiple categories: `units=len(unique(y))`

That gives us our start and end point. Here are two rules of thumb:

**Linear increase**: increase the size of each layer by a fixed amount. (like the above)

**Geometric increase**: increase the size of each layer by a fixed multiplier. (like the below)

- (input) 387
- 64
- 32
- 16
- 8
- 4
- 2
- 1 (output layer)

*How do you choose between the two?* I would consider them a generic hyperparameter that can be tweaked. If you're trying to squeeze every ounce of performance out of a model, compare the cross-validated performance of each. Otherwise, just pick the one you're most comfortable with and stick with it.

*How do you choose the size of increase?* This should be considered a hyperparameter that affects model complexity. A bigger step size and you'll have more parameters and a more complex model that is more likely to overfit. You can use this to tune your model.  
<small>(I like to limit the step size so that the first layer is smaller than the input size, mostly for aesthetic reasons.)</small>

# Concluding

Okay, that's a lot of information and a lot of (hyper)parameters to worry about! Technically, because we can always add more layers, there is an infinite number of hyperparameters in every model. That's kind of daunting! 🙅‍ Here's some final advice on how to deal with this problem and a little demo of how to use sklearn's `GridSearchCV`.

## Advice
- <mark>Start with an existing model that works</mark>. If there's a working model in a paper, blog-post, Metis lesson notebook, by all means *start* there. Then, as you're making this model your own, change it one step at a time and make sure each step doesn't break things.
- Look for ways to <mark>reduce the things you need to worry about</mark>. Examples: I almost always leave `batch_size` at the default because it (typically) doesn't meaningfully affect training. I also use `EarlyStopping` in every model so I don't have to set the number of epochs.
- <mark>Every change that you make should be for a reason</mark>. It's easy to get lost in the infinite tiny decisions in every deep learning model. Being strategic about the changes that you're making will help you navigate this problem. Examples:
    - I want give my model room to be more complex *because I think it's underfit*. I'll add layers and/or increase the size of existing layers. I might also reduce my regularization settings if I already am using that.
    - My model performance is not reliably improving between epochs. I want to test out different optimizers and learning rates *because I think there's an optimization problem*.
    - I want to use regularization on my model *because I think it's overfit*. I'll add dropout or regulariation or increase the strength of those layers.


## `GridSearchCV` demo

Even taking the advice above, we still have a bunch of decisions to make. Here's a demo using sklearn's grid search to tackle a bunch of hyperparameters. We just set up a function to build a model based on the hyperparameters and then let it go to town 💁‍

In [None]:
def make_model(base=20, add=0, noise=.0, drop=0, depth=10, batchnorm=False,
               act_reg=0.01, kern_reg=0.01, lr=0.01):

    layers = [keras.layers.InputLayer(input_shape=x.shape[1:])]

    if noise > 0:
        layers.append(keras.layers.GaussianNoise(noise))

    for mult in range(depth, 0, -1):
        layers.append(
            keras.layers.Dense(
                units=mult * base + add, activation="relu",
                kernel_regularizer=keras.regularizers.l2(kern_reg),
                activity_regularizer=keras.regularizers.l2(act_reg)),
        )
        if batchnorm and (mult % batchnorm == 0):
            layers.append(keras.layers.BatchNormalization())
        if drop > 0:
            if mult == 0:
                pass
            if mult == 1:
                pass
            if mult == 2:
                layers.append(keras.layers.Dropout(.5*drop))
            else:
                layers.append(keras.layers.Dropout(drop))

    layers.append(keras.layers.Dense(1))

    model = keras.Sequential(layers)

    model.compile(keras.optimizers.Adam(lr=lr), loss="mae", metrics=["mae", "mse"])

    return model

In [None]:
test_model = make_model(base=10, add=10, depth=5, drop=0.01, batchnorm=False, act_reg=0, kern_reg=50)
test_model.summary()
test_model.fit(
    x, y, epochs=200, validation_split=.25, verbose=1, callbacks=[
        keras.callbacks.EarlyStopping(
            patience=8,
            verbose=1,
        ),
        keras.callbacks.ReduceLROnPlateau(
            factor=.2,
            patience=3,
            verbose=1,
        ),
    ])

In [None]:
benchmark(test_model)

In [None]:
from keras.wrappers import scikit_learn as k_sklearn
from sklearn import model_selection

keras_model = k_sklearn.KerasRegressor(make_model)

validator = model_selection.GridSearchCV(
    keras_model, param_grid={
        'base': [10, 20], 
        'noise': [0],
        'depth': [5, 10],
        'drop': [0, 0.01],
        'act_reg':[0], 
        'kern_reg':[0,10],
        'batchnorm': [False],
    }, scoring='neg_mean_absolute_error', n_jobs=-1, cv=3, verbose=2)

# Uncomment when you're ready to run. This one will take a while

# validator.fit(
#     x, y, epochs=200, validation_split=.25, verbose=0, callbacks=[
#         keras.callbacks.EarlyStopping(
#             patience=8,
#             verbose=0,
#         ),
#         keras.callbacks.ReduceLROnPlateau(
#             factor=.2,
#             patience=3,
#             verbose=0,
#         ),
#     ])

In [None]:
# y_pred = validator.predict(x_test)
# print(f"mae: {metrics.mean_absolute_error(y_test, y_pred):,.2f}")
# print(f"mse: {metrics.mean_squared_error(y_test, y_pred):,.2f}")

In [None]:
# pd.DataFrame(validator.cv_results_)