### House price predictions with neural networks

Small example to train a neural network to predict house prices using a simple multi-layer neural network in Keras. Data is available on [Kaggle](https://www.kaggle.com/lodhaad/house-prices).

In [1]:
import pandas as pd
import numpy as np

In [2]:
# First lets read our data into memory and view the top rows using the pandas head() function

data = pd.read_csv('home_data.csv')

data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Understanding your data is one of the most important preprocessing steps before tackling a data science problem. One of the easiest ways to look for initial correlations is to plot a correlation matrix. This can help us determine which columns are important and which columns are expendable. It is important to remember however, whilst some fields may have low correlations in their current form this does not mean they cannot be useful with the aid of some further preprocessing.

In [3]:
data.corr()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,1.0,-0.016762,0.001286,0.00516,-0.012258,-0.132109,0.018525,-0.002721,0.011592,-0.023783,0.00813,-0.010842,-0.005151,0.02138,-0.016907,-0.008224,-0.001891,0.020799,-0.002901,-0.138798
price,-0.016762,1.0,0.30835,0.525138,0.702035,0.089661,0.256794,0.266369,0.397293,0.036362,0.667434,0.605567,0.323816,0.054012,0.126434,-0.053203,0.307003,0.021626,0.585379,0.082447
bedrooms,0.001286,0.30835,1.0,0.515884,0.576671,0.031703,0.175429,-0.006582,0.079532,0.028472,0.356967,0.4776,0.303093,0.154178,0.018841,-0.152668,-0.008931,0.129473,0.391638,0.029244
bathrooms,0.00516,0.525138,0.515884,1.0,0.754665,0.08774,0.500653,0.063744,0.187737,-0.124982,0.664983,0.685342,0.28377,0.506019,0.050739,-0.203866,0.024573,0.223042,0.568634,0.087175
sqft_living,-0.012258,0.702035,0.576671,0.754665,1.0,0.172826,0.353949,0.103818,0.284611,-0.058753,0.762704,0.876597,0.435043,0.318049,0.055363,-0.19943,0.052529,0.240223,0.75642,0.183286
sqft_lot,-0.132109,0.089661,0.031703,0.08774,0.172826,1.0,-0.005201,0.021604,0.07471,-0.008958,0.113621,0.183512,0.015286,0.05308,0.007644,-0.129574,-0.085683,0.229521,0.144608,0.718557
floors,0.018525,0.256794,0.175429,0.500653,0.353949,-0.005201,1.0,0.023698,0.029444,-0.263768,0.458183,0.523885,-0.245705,0.489319,0.006338,-0.059121,0.049614,0.125419,0.279885,-0.011269
waterfront,-0.002721,0.266369,-0.006582,0.063744,0.103818,0.021604,0.023698,1.0,0.401857,0.016653,0.082775,0.072075,0.080588,-0.026161,0.092885,0.030285,-0.014274,-0.04191,0.086463,0.030703
view,0.011592,0.397293,0.079532,0.187737,0.284611,0.07471,0.029444,0.401857,1.0,0.04599,0.251321,0.167649,0.276947,-0.05344,0.103917,0.084827,0.006157,-0.0784,0.280439,0.072575
condition,-0.023783,0.036362,0.028472,-0.124982,-0.058753,-0.008958,-0.263768,0.016653,0.04599,1.0,-0.144674,-0.158214,0.174105,-0.361417,-0.060618,0.003026,-0.014941,-0.1065,-0.092824,-0.003406


In [4]:
corr_mat = data.corr()
# Filter by price column and sort descending
corr_mat['price'].sort_values(ascending=False)

price            1.000000
sqft_living      0.702035
grade            0.667434
sqft_above       0.605567
sqft_living15    0.585379
bathrooms        0.525138
view             0.397293
sqft_basement    0.323816
bedrooms         0.308350
lat              0.307003
waterfront       0.266369
floors           0.256794
yr_renovated     0.126434
sqft_lot         0.089661
sqft_lot15       0.082447
yr_built         0.054012
condition        0.036362
long             0.021626
id              -0.016762
zipcode         -0.053203
Name: price, dtype: float64

### Data cleaning and preprocessing

The first thing we need to do before we are ready to train a neural network is prepare our data. First we will split our labels from the main dataset and remove any unwanted fields that may confuse the model.

In [5]:
labels = data[['price']]
features = data.drop(['id', 'date', 'price', 'zipcode', 'yr_built', 'condition','yr_renovated', 'lat', 'long', 'sqft_lot15'], axis=1)

print(features.shape, labels.shape)

(21613, 11) (21613, 1)


Scikit learn, one of the largest python machine learning libraries, and keras are both designed to work with pandas dataframes. Therefore, functions from both libraries can be used to aid each other. 

Here we use the scikit learning preprocessing class to scale our input data. This is important as large number can be problematic for neural networks. To account for this we use a StandardScaler which standardised features by removing the mean and scaling to unit variance.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaled_features = StandardScaler().fit_transform(features.values)

X_train, X_test, y_train, y_test = train_test_split(scaled_features, labels.values, test_size=0.1, random_state=42)

print('Train Size', y_train.shape, y_train.shape)
print('Test Size', X_test.shape, y_test.shape)

Train Size (19451, 1) (19451, 1)
Test Size (2162, 11) (2162, 1)


### Creating out model

Now we need to define our model architecture and hyperparameters. The options here define not only the shape of your network but how it learns. This is where we can easily experiment with all the complex underlying mathematical principles behind neural networks.

At the core of keras is the **Sequential** model. Put simply a sequential model is a step-by-step instruction for the network where the output of one line becomes the input of the next. The most important function here is the **Dense** layer. The dense layer multiplies the inputs by the weight matrix and adds the bias.

In [7]:
from keras.models import Sequential
from keras import optimizers
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(8, input_dim=X_train.shape[1], kernel_initializer="normal", activation='relu'))
model.add(Dense(4, kernel_initializer="normal", activation='relu'))
model.add(Dense(4, kernel_initializer="normal", activation='relu'))
model.add(Dense(8, kernel_initializer="normal", activation='relu'))
model.add(Dense(1))

model.summary()

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 8)                 96        
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 36        
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 20        
_________________________________________________________________
dense_4 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 9         
Total params: 201
Trainable params: 201
Non-trainable params: 0
_________________________________________________________________


Next we set our model hyper-parameters. The key parameters to decide here are the [loss function](https://keras.io/losses/), [optimiser](https://keras.io/optimizers/), [epoch and batch](https://keras.io/getting-started/faq/#what-does-sample-batch-epoch-mean). Understanding each of these and experimenting with different combinations can is the key to a successful model.

In [None]:
# Set learning rate
lr = 0.3

# Set optimiser
opt = optimizers.Adam(lr=lr)

# Compile model
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mae'])

# Set to variable if you want to store training statistics
history = model.fit(X_train, y_train, epochs=20, batch_size=32)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20


We can import a plot library to visualise statistics of our model training. This can be very useful for determining if models are still imrpoving, have already converged or are over-fitting.

In [None]:
from matplotlib import pylab as plt
%matplotlib inline

print(history.history.keys())

plt.figure()
plt.plot(history.history['loss'])
plt.show()

### Evaluating model performance

Once you have trained your model its performance needs to be evaluated.  The easiest way to do this is first run your model on your entire test dataset that we set aside earlier. Once we have a list of our results we can use a simple loop to iterate through the results and compare each result with the actual value. 

**Note:** Remember to calcualte an inverse of the scalar we applied earlier to scale the numbers back to there original scale.

In [None]:
from sklearn.metrics import mean_absolute_error

predictions = model.predict(X_test)

mae = mean_absolute_error(y_test, predictions)

print("Total error: $%.2f" %mae)

### Linear regression

Both keras and scikit-learn are designed to take numpy arrays and pandas data frames as inputs. Therefore we can easily pass our training data into a range of scikit-learn regression models such as; [Linear](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) or [Support Vector Machine](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) regression model.

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(X_train, y_train)

In [None]:
error = []

l_predictions = regr.predict(X_test)

l_mae = mean_absolute_error(y_test, l_predictions)

print("Total error: $%.2f" %l_mae)