# Predicting Boston Housing Prices
The aim of this project is to evaluate the performance and predictive power of a model which has been trained on the Boston house prices datatset. 

##Loading dataset
In this cell, the dataset will be loaded along with some necessary libraries of Python. The dataset will be reported when it is successfully loaded.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import ShuffleSplit

#importing visuals for visualisations
import visuals as vs

%matplotlib inline

data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

Boston housing dataset has 489 data points with 4 variables each.


##Defining a performance metric
Here, we'll be using Co-efficient of Determination to evaluate the performance of the model. 

In [None]:
from sklearn.metrics import r2_score

In [None]:
def performance_metric(y_true, y_predict):
  score = r2_score(y_true, y_predict)

  return score

##Splitting the data 
Here the data will be split into training and testing data using 'train_test_split' from 'model_selection'

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, prices, train_size = 0.8, random_state = 24)

print("Training and testing split successful.")

Training and testing split successful.


##Fitting the model
Now that the data set is ready, we can fit the model. Here, **decision tree algorithm** will be used to train the model. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the `'max_depth'` parameter for the decision tree.

In addition,in our implementation, we will be using `ShuffleSplit()` for an alternative form of cross-validation. The `ShuffleSplit()` implementation below will create 10 (`'n_splits'`) shuffled sets, and for each shuffle, 20% (`'test_size'`) of the data will be used as the *validation set*.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

In [None]:
def fit_model(X, y):
  cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
  regressor = DecisionTreeRegressor(random_state=24)
  params = {'max_depth' : range(1,10)}
  scoring_fnc = make_scorer(performance_metric)
  grid = GridSearchCV(regressor, params, scoring=scoring_fnc, cv=cv_sets)
  grid_fit = grid.fit(X, y)

  return grid_fit.best_estimator_

In [None]:
reg = fit_model(X_train, y_train)

print("Parameter 'max_depth' is {} for the optimal model".format(reg.get_params()['max_depth']))

Parameter 'max_depth' is 4 for the optimal model


##Predicting the selling prices
Now that the model has been trained, the model will be used to predict the housing prices for various values.

In [None]:
client_data = [[5, 17, 15],
               [4, 32, 22], 
               [8, 3, 12]] 

# Show predictions
for i, price in enumerate(reg.predict(client_data)):
    print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price))

Predicted selling price for Client 1's home: $412,324.14
Predicted selling price for Client 2's home: $234,546.67
Predicted selling price for Client 3's home: $914,025.00


In [None]:
#Store the predicted prices
prices = []

for k in range(10):
  X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=6)

  reg = fit_model(X_train, y_train)

  pred = reg.predict([client_data[0]])[0]
  prices.append(pred)

  print("Trial {}: ${:,.2f}".format(k+1, pred))

#Display price range
print("\nRange in prices: ${:,.2f}".format(max(prices) - min(prices)))

Trial 1: $374,780.00
Trial 2: $395,581.40
Trial 3: $374,675.00
Trial 4: $360,640.00
Trial 5: $374,700.00
Trial 6: $422,730.00
Trial 7: $485,321.05
Trial 8: $450,030.00
Trial 9: $394,800.00
Trial 10: $625,800.00

Range in prices: $265,160.00
