# California housing dataset regression with MLPs

In this notebook, we'll train a multi-layer perceptron model to to estimate median house values on Californian housing districts.

First, the needed imports. Keras tells us which backend (Theano, Tensorflow, CNTK) it will be using.

In [None]:
%matplotlib inline

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import np_utils
from keras import backend as K

from distutils.version import LooseVersion as LV
from keras import __version__

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Keras version:', __version__, 'backend:', K.backend())
assert(LV(__version__) >= LV("2.0.0"))

## Data

Then we load the California housing data. First time we need to download the data, which can take a while.

In [None]:
chd = datasets.fetch_california_housing()

The data consists of 20640 housing districts, each characterized with 8 attributes: *MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude*. There is also a target value (median house value) for each housing district.
 
Let's plot all attributes against the target value:

In [None]:
plt.figure(figsize=(15,10))
for i in range(8):
    plt.subplot(4,2,i+1)
    plt.scatter(chd.data[:,i], chd.target, s=2, label=chd.feature_names[i])
    plt.legend(loc='best')

We'll now split the data into a training and a test set: 

In [None]:
test_size = 5000

X_train_all, X_test_all, y_train, y_test = train_test_split(
    chd.data, chd.target, test_size=test_size, shuffle=True)

X_train_single = X_train_all[:,0].reshape(-1, 1)
X_test_single = X_test_all[:,0].reshape(-1, 1)
     
print()
print('California housing data: train:',len(X_train_all),'test:',len(X_test_all))
print()
print('X_train_all:', X_train_all.shape)
print('X_train_single:', X_train_single.shape)
print('y_train:', y_train.shape)
print()
print('X_test_all', X_test_all.shape)
print('X_test_single', X_test_single.shape)
print('y_test', y_test.shape)

The training data matrix `X_train_all` is a matrix of size (`n_train`, 8), and `X_train_single` contains only the first attribute *(MedInc)*. `y_train` is a vector containing the target value (median house value) for each housing district in the training set.

Let's start our analysis with a single attribute *(MedInc)*:

In [None]:
X_train = X_train_single
X_test = X_test_single

#X_train = X_train_all
#X_test = X_test_all

As the final step, let's scale the input data to zero mean and unit variance: 

In [None]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print('X_train: mean:', X_train.mean(axis=0), 'std:', X_train.std(axis=0))
print('X_test: mean:', X_test.mean(axis=0), 'std:', X_test.std(axis=0))

## One hidden layer

### Initialization

Let's begin with a simple model that has a single hidden layer.  We first initialize the model with `Sequential()`.  Then we add a `Dense` layer that has `X_train.shape[1]` inputs (one for each attribute in the training data) and 10 units. The `Dense` layer connects each input to each output with some weight parameter. 
Then we have an output layer that has only one unit with a linear activation function.

Finally, we select *mean squared error* as the loss function, select [*stochastic gradient descent*](https://keras.io/optimizers/#sgd) as the optimizer, and `compile()` the model. Note there are [several different options](https://keras.io/optimizers/) for the optimizer in Keras that we could use instead of *sgd*.

In [None]:
linmodel = Sequential()
linmodel.add(Dense(units=10, input_dim=X_train.shape[1], activation='relu'))
linmodel.add(Dense(units=1, activation='linear'))

linmodel.compile(loss='mean_squared_error', 
                 optimizer='sgd')
print(linmodel.summary())

We can also draw a fancier graph of our model.

In [None]:
SVG(model_to_dot(linmodel, show_shapes=True).create(prog='dot', format='svg'))

### Learning

Now we are ready to train our first model.  An *epoch* means one pass through the whole training data. 

You can run code below multiple times and it will continue the training process from where it left off.  If you want to start from scratch, re-initialize the model using the code a few cells ago. 

In [None]:
%%time
epochs = 10 

linhistory = linmodel.fit(X_train, 
                          y_train, 
                          epochs=epochs, 
                          batch_size=32,
                          verbose=2)

Let's now see how the training progressed. *Loss* is a function of the difference of the network output and the target values.  We are minimizing the loss function during training so it should decrease over time.

In [None]:
plt.figure(figsize=(5,3))
plt.plot(linhistory.epoch,linhistory.history['loss'])
plt.title('loss');

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, linmodel.predict(reg_x), s=8, label='one hidden layer')
    plt.legend(loc='best');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
%%time

predictions = linmodel.predict(X_test)
print("Mean squared error: %.3f"
      % mean_squared_error(y_test, predictions))

## Multiple hidden layers

### Initialization

Let's now create a more complex MLP model that has multiple dense layers and dropout layers.  `Dropout()` randomly sets a fraction of inputs to zero during training, which is one approach to regularization and can sometimes help to prevent overfitting.

The last layer needs to have a single unit with linear activation to match the groundtruth (`Y_train`). 

Finally, we again `compile()` the model, this time using [*Adam*](https://keras.io/optimizers/#adam) as the optimizer.

In [None]:
mlmodel = Sequential()

mlmodel.add(Dense(units=20, input_dim=X_train.shape[1], activation='relu'))
mlmodel.add(Dense(units=20, activation='relu'))
mlmodel.add(Dropout(0.5))

mlmodel.add(Dense(units=1, activation='linear'))

mlmodel.compile(loss='mean_squared_error', 
                optimizer='adam')
print(mlmodel.summary())

In [None]:
SVG(model_to_dot(mlmodel, show_shapes=True).create(prog='dot', format='svg'))

### Learning

In [None]:
%%time
epochs = 10 

history = mlmodel.fit(X_train, 
                      y_train, 
                      epochs=epochs, 
                      batch_size=32,
                      verbose=2)

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'])
plt.title('loss');

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, linmodel.predict(reg_x), s=8, label='one hidden layer')
    plt.scatter(reg_x, mlmodel.predict(reg_x), s=8, label='multiple hidden layers')
    plt.legend(loc='best');

### Inference

In [None]:
%%time

predictions = mlmodel.predict(X_test)
print("Mean squared error: %.3f"
      % mean_squared_error(y_test, predictions))

## Model tuning

Try to reduce the mean squared error of the regression.