## Boston House Price Prediction

The dataset describes 13 numerical properties of houses in Boston suburbs and is concerned with modeling the price of houses in those suburbs in thousands of dollars. As such, this is a regression predictive modeling problem. Input attributes include things like crime rate, proportion of nonretail business acres, chemical concentrations and more.

Reasonable performance for models evaluated using Mean Squared Error (MSE) are around 20 in squared thousands of dollars (or $4,500 if you take the square root). This is a nice target to aim for with our neural network model.

In [14]:
#develop a baseline neural net model
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#load dataset
dataframe=pandas.read_csv('housing.data',delim_whitespace=True,header=None)
dataset=dataframe.values
#split inot input(X) nad output (y) variables
X=dataset[:,0:13]
Y=dataset[:,13]

#define a base model
def baseline_model():
    #create model
    model=Sequential()
    model.add(Dense(13,input_dim=13,init='normal',activation='relu'))
    model.add(Dense(1,init='normal'))
    #compile model
    model.compile(loss='mean_squared_error',optimizer='adam')
    return model


#fix random seed for reproductibility
seed=43
numpy.random.seed(seed)
#evaluate model woth standardized dataset
estimator=KerasRegressor(build_fn=baseline_model,nb_epoch=100,batch_size=5,verbose=0)


kfold=KFold(10,random_state=seed)
results=cross_val_score(estimator,X,Y,cv=kfold)
print("Results: %.2f  (%.2f) MSE" % (results.mean(),results.std()))



Results: 80.14  (35.58) MSE


### Modeling the Standardized Dataset
An important concern with the Boston house price dataset is that the input attributes all vary in their scales because they measure different quantities.

It is almost always good practice to prepare your data before modeling it using a neural network model.

Continuing on from the above baseline model, we can re-evaluate the same model using a standardized version of the input dataset.

We can use scikit-learn’s Pipeline framework to perform the standardization during the model evaluation process, within each fold of the cross validation. This ensures that there is no data leakage from each testset cross validation fold into the training data.

In [15]:
#evaluate model with standardized dataset
estimators=[]
estimators.append(('standardize',StandardScaler()))
estimators.append(('mlp',KerasRegressor(build_fn=baseline_model,nb_epoch=50,batch_size=5,verbose=0)))
pipeline=Pipeline(estimators)
kfold=KFold(10,random_state=seed)
results=cross_val_score(pipeline,X,Y,cv=kfold)
print ("Standardized: %.2f  (%.2f) MSE" %(results.mean(),results.std()))


Standardized: 691.05  (187.29) MSE


### Evaluate a deeper Network Topology


In [21]:
def larger_model():
	# create model
	model = Sequential()
	model.add(Dense(13, input_dim=13, init='normal', activation='relu'))
	model.add(Dense(6, init='normal', activation='relu'))
	model.add(Dense(1, init='normal'))
	# Compile model
	model.compile(loss='mean_squared_error', optimizer='adam')
	return model

numpy.random.seed(seed)
estimators = []
#estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp1', KerasRegressor(build_fn=larger_model, nb_epoch=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Larger: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Larger: 52.78 (22.53) MSE
