# Boston House Prices - Regression Task <a class="anchor" id="boston-home-anchor"></a>
* [Loading the Data](#boston-load-anchor)
* [Manipulating the Data](#boston-manipulate-anchor)
* [Building the Model](#boston-build-anchor)
* [Evaluating the Model](#boston-evaluate-anchor)

In this notebook, I will walk you through the task of predicting housing prices via regression. 
## Loading the Data <a class="anchor" id="boston-load-anchor"></a>
[home](#boston-home-anchor)

The dataset we use for this task is the well-known *Boston Housing Prices Dataset* which consists of 404 training samples and 102 testing samples. Each sample consists of 13 features. 

In [1]:
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


## Manipulating the Data <a class="anchor" id="boston-manipulate-anchor"></a>
[home](#boston-home-anchor)

When there are different features which have very different scales, we need to normalize the dataset in order to make the learning process easier. We **normalize** the dataset by taking the mean and the standard deviation of each feature in the training set, then we substitute the mean and divide by the standard deviation. Note that we take the mean and std of the training set only because we don't want to use the testing data during the training process.

In [10]:
import numpy as np

mean = train_data.mean(axis=0)
std  = train_data.std(axis=0)
train_data -= mean
train_data /= std

test_data -= mean
test_data /= std

## Building the Model
For the task at hand we use a neural network model which consists of an input layer, two hidden layers and an output layer. Each hidden layer has 64 nodes. We have 404 training samples, each consisting of 13 features. In other words, we don't have much data so we shouldn't build a neural network with a lot of layers and nodes because it leads to *overfitting*. There isn't an activation function for the output layer because we have a regression task and we don't want an activation function to put a constraint on the range of values which the output layer can take.

In [11]:
from keras import models
from keras import layers

def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

## Evaluating the Model
We want to split the training set into a train set and a validation set so that we can find the optimal hyperparameters - number of epochs, batch size, etc. However, in this example there are only 404 training samples which means that the choice of a validation set will greatly influence the results, so we will use **K-fold Cross-Validation**. Below is the code of how to implement the method.

In [13]:
k               = 4
num_val_samples = len(train_data) // k
num_epochs      = 100
all_scores      = []

for i in range(k):
    print('Processing fold #', i)
    val_data    = train_data[i * num_val_samples: (i+1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i+1) * num_val_samples]
    
    partial_train_data    = np.concatenate([train_data[:i * num_val_samples], train_data[(i+1) * num_val_samples:]], axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], train_targets[(i+1) * num_val_samples:]], axis=0)
    
    model = build_model()
    model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=1, verbose=0)
    
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)

Processing fold # 0
Processing fold # 1
Processing fold # 2
Processing fold # 3


In [17]:
np.mean(all_scores)

2.37664994597435