# Continuing with Keras intro

## Objectives
- Continue the basics of neural networks with Keras
- Build and train a simple regression model
- Explore some more features of the Keras/Tensorflow ecosystem

First we'll load the necessary libraries and make sure we have the right versions.

In [None]:
import sys
assert sys.version_info >= (3, 7)

from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

import numpy as np
import tensorflow as tf

assert version.parse(tf.__version__) >= version.parse("2.8.0")

import matplotlib.pyplot as plt
import pandas as pd

## Predicting house prices
And you thought you were done with the California housing dataset!

In [None]:
# extra code – load and split the California housing dataset, like earlier
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42)
# split it again to get a validation set
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

print("Training instances: ", y_train.shape)
print("Validation instances: ", y_valid.shape)
print("Testing instances: ", y_test.shape)

This time, let's use the approach of a constant number of neurons per layer and see how things go.

The [Adam optimizer](https://arxiv.org/abs/1412.6980) is a very popular adaptive learning rate method that takes into account both the first and second moments (mean and variance) of the gradients. The step ends up being with high moments, resulting in smaller steps when the gradient is both small and smooth.

We'll also use a `Normalization` layer to scale the input features (the `StandardScaler` from scikit-learn would also work).

❓ **Discussion questions**: 
- What should the output layer look like in a regression problem?
- How should we pick number of neurons and layers?

In [None]:
tf.random.set_seed(42)
norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:])
model = tf.keras.Sequential([
    norm_layer,
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

# the adapt method computes the mean and standard deviation of the input features
# Note that only the training data is used to compute the mean and standard deviation!
norm_layer.adapt(X_train)
history = model.fit(X_train, y_train, epochs=100, validation_batch_size=len(X_valid),
                    validation_data=(X_valid, y_valid))

In [None]:
# plot the training curves
pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

The validation behaviour looks awfully weird. What might be happening?

Let's look at the [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method to try to understand the training process (and hyperparameters) better.

❓ **Discussion questions**: 
- What is the `batch_size` parameter?
- How do you decide on the minibatch size?
- What are the defaults?


In [None]:
mse_test, rmse_test = model.evaluate(X_test, y_test)

Hmm... is that a good RMSE value? Let's try a scatter plot to see how we did.

In [None]:
plt.scatter(y_test, model.predict(X_test), alpha=0.1)
plt.xlabel("Actual House price ($100,000)")
plt.ylabel("Predicted House price ($100,000)")
# Probably better than our random forest model, but still not great

## Building Complex Models Using the Functional API

Not all neural network models are simply sequential. Some may have complex topologies. Some may have multiple inputs and/or multiple outputs. For example, a Wide & Deep neural network (see [paper](https://ai.google/research/pubs/pub45413)) connects all or part of the inputs directly to the output layer.

In [None]:
# extra code – reset the name counters and make the code reproducible
tf.keras.backend.clear_session()
tf.random.set_seed(42)

Here we'll build a new model using the functional API. We could do the exact same model as before, but the functional API adds on flexibility for more complex models.

The layers are as follows:
- Input layer, same as before but we need to be more explicit about it (i.e. specify the shape)
- Normalization (same as before)
- 2x Dense layers with only 30 neurons each this time
- Concatenation of the input and the output of the second Dense layer - also called a "skip connection"
- Our output layer, same as before

In [None]:
normalization_layer = tf.keras.layers.Normalization()
hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
hidden_layer2 = tf.keras.layers.Dense(30, activation="relu")
concat_layer = tf.keras.layers.Concatenate()
output_layer = tf.keras.layers.Dense(1)

input_ = tf.keras.layers.Input(shape=X_train.shape[1:])
normalized = normalization_layer(input_)
hidden1 = hidden_layer1(normalized)
hidden2 = hidden_layer2(hidden1)
concat = concat_layer([normalized, hidden2])
output = output_layer(concat)

model = tf.keras.Model(inputs=[input_], outputs=[output])

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
normalization_layer.adapt(X_train)
history = model.fit(X_train, y_train, epochs=20, batch_size=len(X_test),
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)

In [None]:
mse_test, rmse_test = model.evaluate(X_test, y_test)
# plot the training curves
pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

Hmm, the RMSE is actually a bit worse than before. However, the functional API allows for a lot more flexibility - the [original notebook](https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb) and associated text in chapter 10 goes into a lot more detail. You can do things like:
- Split the inputs so that some go through the "deep" layers and some go directly to the output
- Define multiple outputs - for example, predict both the house price and classify it as a "good deal" or not
- Adding auxiliary outputs to do stuff with intermediate layers

Finally, you can also define a model by subclassing the `Model` class and defining your own `call` method to create a more dynamic model. This is also the PyTorch way of doing things.

## Saving and Restoring a Model
Ultimately after spending all this time training a model, you'll probably want to save the weights so you can use it later. You can also define a **custom callback** to save the model periodically during training in case of a crash, timeout, to save the best intermediate result, etc.

In [None]:
from pathlib import Path
import shutil

# extra code – delete the directory, in case it already exists
shutil.rmtree("my_keras_model", ignore_errors=True)

In [None]:
model.save("my_keras_model", save_format="tf")

In [None]:
# extra code – show the contents of the my_keras_model/ directory
for path in sorted(Path("my_keras_model").glob("**/*")):
    print(path)

In [None]:
model = tf.keras.models.load_model("my_keras_model")

In [None]:
model.save_weights("my_weights")

In [None]:
model.load_weights("my_weights")

In [None]:
# extra code – show the list of my_weights.* files
for path in sorted(Path().glob("my_weights.*")):
    print(path)

## Using Callbacks
Here we'll define two simple callbacks (built in to Keras): One for early stopping and one for saving the model at the end of each epoch. The early stopping callback will stop the training if the validation loss stops decreasing for a certain number of epochs.

We can also define custom callbacks - again the original notebook goes into a lot more detail.

In [None]:
shutil.rmtree("my_checkpoints", ignore_errors=True)  # extra code

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(42)

# make a copy of the model, with the same architecture, but randomly initialized weights
model = tf.keras.models.clone_model(model)
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), metrics=["RootMeanSquaredError"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_checkpoints",save_weights_only=True)
history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid), callbacks=[checkpoint_cb, early_stopping_cb])


In [None]:
# Plot the model performance
pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

## Hyperparameter Tuning
We could spend days tweaking things, or we could be more systematic about it (like the `GridSearchCV` in scikit-learn from week 1). For that matter, we could use scikit-learn tools directly, but there's also a Keras Tuner library that's built for this.

❓ **Discussion questions**: 
- What are some hyperparameters we could tune?
- Which ones are most important?
- How do we decide on the range of values to try?

Let's use the Keras Tuner to do a quick hyperparameter search on the housing price prediction problem. For this to work, we need to wrap our model creation into a function that takes an `hp` argument (for hyperparameters) and returns a model.

We'll go back to the sequential model for simplicity - it was actually working the best anyway.

In [None]:
import keras_tuner as kt

def build_model(hp):
    # Original model had 3 hidden layers with 50 neurons each
    n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
    n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Normalization(input_shape=X_train.shape[1:]))
    for _ in range(n_hidden):
        model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
    model.add(tf.keras.layers.Dense(1))
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
    model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
    return model

In [None]:
random_search_tuner = kt.RandomSearch(
    build_model,
    objective="val_loss",
    max_trials=5,
    overwrite=True,
    directory="random_search",
    project_name="california_housing",
    seed=42)

random_search_tuner.search(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))

In [None]:
random_search_tuner.get_best_models()[0].summary()