## Predicting house prices : Boston data

* It is a regression problem.

#### Boston House Pricing Dataset
Predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on. It has relatively few data points: only 506, split between 404 training samples and 102 test samples. And each feature in the input data (for example, the crime rate) has a different scale. For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others
between 0 and 100, and so on.

### 1.Loading the dataset

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from keras import models
from keras import layers
from keras.datasets import boston_housing

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
#train test split of the dataset
(train_data,train_targets),(test_data,test_targets)=boston_housing.load_data()

In [4]:
train_data.shape

(404, 13)

* The trianing dataset consist of 404 observations on 13 different features.

In [5]:
test_data.shape

(102, 13)

* The test dataset consist of 102 observations on 13 different features. 

In [6]:
print(train_targets.min(),'-----',train_targets.max())

5.0 ----- 50.0


* The targets are the median values of owner-occupied homes, in thousands of dollars: (The prices are typically between \$5,000 and \$50,000)

### 2. Data preparation
It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation.

In [7]:
mean=train_data.mean(axis=0)
std=train_data.std(axis=0)

#normalize train data
train_data-=mean
train_data /=std

#normalize test data
test_data-=mean
test_data/=std
#you should never look into the test data even if you have to do normalization

### 3.Building the model

In [42]:
def build_model():
    model = models.Sequential()
    
    # Hidden layer 1 => 128 units
    model.add(layers.Dense(128, activation='relu', input_shape=(train_data.shape[1],)))
    
    # Hidden layer 2 => 64 units
    model.add(layers.Dense(64, activation='relu'))
    
    # Output layer with no activation (equivalent to identity activation => for regression problems)
    model.add(layers.Dense(1))
    
    # Compile the model (Note the loss function and metric)
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    
    # return the compiled model
    return model

### 4. Validating using K-fold validation
Since we have very less data we can not split training dataset for cross validiation ,it will not make any sense to use about 100 observation for CV and because of these less number of observation the model might show overfitting on CV data. Therefore the best thing to do is to use K-fold validation. Here we are going to use 4-fold validation.

In [43]:
k=4 #no. of folds
num_val_sample=len(train_data)//k #size of each fold 
num_epochs=100
all_scores=[] #list of scores of each fold

In [44]:
for i in range(k):
    print("processing fold : ",i+1)
    
    #preparing the validation data from partition i
    val_data=train_data[i*num_val_sample:(i+1)*num_val_sample]
    val_targets=train_targets[i*num_val_sample:(i+1)*num_val_sample]
    
    #preparing the training data from partition i
    partial_train_data=np.concatenate([train_data[:i*num_val_sample:],train_data[(i+1)*num_val_sample:]],axis=0)
    partial_train_targets=np.concatenate([train_targets[:i*num_val_sample:],train_targets[(i+1)*num_val_sample:]],axis=0)    
    
    #using prebuild model
    model=build_model()
    
    test_model=model.fit(partial_train_data,partial_train_targets,validation_data=(val_data,val_targets),batch_size=1,verbose=0)
    
    # Evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    
    all_scores.append(val_mae) #appending the score to list

processing fold :  1
processing fold :  2
processing fold :  3
processing fold :  4


In [45]:
all_scores

[3.083348321442557, 3.7941452182165465, 3.6494627683469565, 4.640469924058064]

In [46]:
np.mean(all_scores)

3.791856558016031

* The different runs do indeed show rather different validation scores, from 3.08 to 4.64. The average (3.79) is a much more reliable metric than any single score — that’s the entire point of K-fold cross-validation. In this case, you’re off by \$3,791 on average.

#### training the final model 

* you can try to reduce no. of epoch in above code and its been observed that the score is becoming more bad , so let's try to reduce no. of epochs

In [60]:
# Get a fresh, compiled model
model = build_model()

# Train it on the entirety of the data
# Using batch_size=10
model.fit(train_data, train_targets, epochs=40, batch_size=10, verbose=0)

<keras.callbacks.History at 0x16d2d6b6940>

In [61]:
# Evaluate on the test data
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)



In [62]:
# Final score
test_mae_score

2.585851042878394

* On the test set, we are off by about $2,585