# **Boston Housing Prices Demo**

By Saurabh Ghanekar from Nimblebox Inc.

## **Introduction**

Today in this demo, we will see how to build our own machine learning model to predict housing prices in Boston using the famous Boston Housing Price dataset based on the different features described in the dataset.

## **Installing Dependencies**

Most of the packages that we will be using are pre-installed on the Nimblebox platform. But still, the latest version of the following packages should be installed in your instance.

* TensorFlow 2.x
* Pandas
* Matplotlib
* Numpy
* Seaborn
* Scikit-learn

## **Loading the Dataset**
In this dataset, each row describes a Boston town or suburb. There are 506 rows and 13 attributes (features) with a target column (price). We will use pandas and scikit-learn to load and explore the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston

boston_dataset = load_boston()

We will use `DESCR` to know the description of each column name and display our dataset in a nutshell.

In [None]:
print(boston_dataset.DESCR)

In order to perform exploratory data analysis, we need to convert our dataset into a Pandas Dataframe and print the head of our dataframe.

In [None]:
dataset = pd.DataFrame(boston_dataset.data)
dataset.columns = boston_dataset.feature_names

print("Original Dataframe")
print("[INFO] dataset type: {}".format(type(dataset)))
print("[INFO] dataset shape: {}".format(dataset.shape))

# Insert the target column in our main dataframe
dataset["PRICE"] = boston_dataset.target

print("\nDataframe with target column")
print("[INFO] dataset type: {}".format(type(dataset)))
print("[INFO] dataset shape: {}".format(dataset.shape))

print("\n",dataset.head())

## **Analysis of Dataset**

### **Statistical Summary**

Now that our dataframe is ready, we will use `describe()` function to understand the statistical summary of the dataset. This function shows count, min, max, mean and standard deviation for each column of our dataset.

In [None]:
print(dataset.describe())

### **Correlation**

Finding a correlation between attributes is a highly useful way to check for patterns in the dataset. It helps us to find statistical relations between different attributes of our dataset. The output of each of these correlation functions falls within the range [-1, 1].

* 1 - Positively correlated
* -1 - Negatively correlated
* 0 - Not correlated

We will use `corr()` function to compute the correlation between attributes and use `heatmap()` function to visualize the correlation matrix.

In [None]:
# correlation between attributes

print(dataset.corr())
sns.heatmap(dataset.corr())
plt.savefig("heatmap_correlation.png")
plt.show()
plt.clf()
plt.close()

The Pandas `corr()` function has different methods to find correlations like Pearson Correlation, Kendall Correlation or Spearman Correlation. The default method is to find correlations is Pearson Correlation.

### **Missing Values**

Sometimes, in a dataset, we will have missing values such as `NaN` or an empty string in a cell. We need to take care of these missing values so that our machine learning model doesn’t break. To handle missing values, there are three approaches followed.

* Replace the missing value with a large negative number (e.g. -999).
* Replace the missing value with mean of the column.
* Replace the missing value with median of the column.

To find if a column in our dataset has missing values, you can use `pd.isnull(df).any()` which returns a boolean for each column in the dataset that tells if the column contains any missing value.

In [None]:
print(pd.isnull(dataset).any())

Turns out, our Boston Housing Prices Dataset doesn’t have any missing values.

## **Visualizing the Dataset**

Visualization of data allows trends and patterns to be more easily seen. We will use Box Plot, Density Plot, and Scatter Plot to visualize our dataset.

In [None]:
# visualize the dataset
import random
import os

sns.set(color_codes=True)
colors = ["y", "b", "g", "r"]

cols = list(dataset.columns.values)

### **Box Plot**

A box-whisker plot is a univariate plot used to visualize a data distribution.

* The ends of whiskers are the maximum and minimum range of data distribution.
* The central line in the box is the median of the entire data distribution.
* The right and left edges in the box are the medians of the data distribution to the right and left from the central median, respectively.

In [None]:
# Box Plot
if not os.path.exists("plots/box_plot"):
    os.makedirs("plots/box_plot")

# draw a boxplot with vertical orientation
for i, col in enumerate(cols):
    sns.boxplot(dataset[col], color=random.choice(colors), orient="v")
    plt.savefig("plots/box_plot/box_plot_" + str(i) + ".png")
    plt.show()
    plt.clf()
    plt.close()

Using the box plots, we could see that there are outliers in the dataset for different attributes in our dataset.

### **Density Plot**

Density plot is a univariate plot that draws a histogram of the data distribution and fits a Kernel Density Estimate (KDE).

In [None]:
# Density Plot
if not os.path.exists("plots/density_plot"):
    os.makedirs("plots/density_plot")

# draw a histogram and fit a kernel density estimate (KDE)
for i, col in enumerate(cols):
    if col == "CHAS":
        pass  # We do this because it is a binary data and KDE cannot fit it
    else:
        sns.distplot(dataset[col], color=random.choice(colors))
        plt.savefig("plots/density_plot/density_plot_" + str(i) + ".png")
        plt.show()
        plt.clf()
        plt.close()

Using the density plots, we can see that,
* `CRIM`, `AGE`, `B`, and `ZN` have an exponential distribution.
* `NOX`, `RM`, and `LSTAT` have a skewed gaussian distribution.
* `RAD` and `TAX` have a bimodal distribution.

### **Scatter Plot**

A Scatter plot is used to understand the relationship between two different attributes in the dataset. Below we have compared `PRICE` (target) v/s each of the attributes in the dataset.

In [None]:
if not os.path.exists("plots/scatter_plot"):
    os.makedirs("plots/scatter_plot")

# bivariate plot between target and different attributes
for i, col in enumerate(cols):
    if (i == len(cols) - 1):
        pass
    else:
        sns.jointplot(x=col, y="PRICE", data=dataset);
        plt.savefig("plots/scatter_plot/PRICE_vs_" + str(i) + ".png")
        plt.show()
        plt.clf()
        plt.close()

## **Data Preprocessing**

We will first split our data into training and test sets using the `train_test_split` method from scikit-learn. For this demo, we will use a split of 70% of the data for training and 30% for testing. We also set a `random_state` seed, in order to allow reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

X = dataset.drop("PRICE", axis=1)
y = dataset["PRICE"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

We will normalize our dataset in order to provide a standardized input to our machine learning model.

In [None]:
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)

X_train = (X_train - mean) / std
X_test = (X_test - mean) / std

## **Building our Model**

In this demo, we are going to three machine learning models to predict housing prices in Boston.

### **Linear Regression Model**

First, let's try to predict housing prices with the Linear Regression algorithm and see how our model performs.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print('Mean squared error on test data: ', mse_lr)
print('Mean absolute error on test data: ', mae_lr)

Let’s see the major features that have an impact on the model output.

In [None]:
feature_importance = lr_model.coef_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, dataset.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("plots/feature_importance_lr.png")
plt.show()
plt.clf()
plt.close()

### **Decision Tree**

Now let’s try our prediction using the Decision Tree algorithm and see how our model performs.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)

y_pred_tree = tree.predict(X_test)

mse_dt = mean_squared_error(y_test, y_pred_tree)
mae_dt = mean_absolute_error(y_test, y_pred_tree)

print('Mean squared error on test data: ', mse_dt)
print('Mean absolute error on test data: ', mae_dt)

Let’s see the major features that have an impact on the model output.

In [None]:
feature_importance = tree.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, dataset.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("plots/feature_importance_lr.png")
plt.show()
plt.clf()
plt.close()

### **Artificial Neural Network**

We will create a neural network to predict the housing prices and see how our model performs.

#### **Building the model**

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(128, input_shape=(13, ), activation='relu', name='dense_1'))
model.add(Dense(64, activation='relu', name='dense_2'))
model.add(Dense(1, activation='linear', name='dense_output'))

model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()

#### **Training our model and Evaluating our model**

We will fit our model with both our features and their labels, for a total amount of 100 epochs, separating 5% of the samples (18 records) as a validation set.

In [None]:
history = model.fit(X_train, y_train, epochs=100, validation_split=0.05)

Now that we have successfully trained our model, let’s see how our model does.

In [None]:
mse_nn, mae_nn = model.evaluate(X_test, y_test)

print('Mean squared error on test data: ', mse_nn)
print('Mean absolute error on test data: ', mae_nn)

## **Congratulations!!**

We successfully built our own machine learning model to predict housing prices in Boston using the famous Boston Housing Price dataset and different machine learning techniques like Linear Regression, Decision Tree and Neural Network.