***
<b> Author:</b> Raghavendra Tapas
    
<b> Updated on:</b> May 2021
    
<b> Context:</b> Boston House Price Prediction. <b>Source:</b> __[databricks](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2217267761988785/1705666210822011/2819141955166097/latest.html)__

Feel free to reach out to me on __[Twitter](https://twitter.com/raghutapas12)__ for any corrections or additional updates!
***


# Boston House Price Prediction!

This notebook demonstrates `Regression problem`.

<b>Steps:

1. Data Pre-Processing
2. Choosing the right estimator/algorithm for the problem.
3. Fit the model/algorithm and use it to make predictions on our data.
4. Evaluating model.
5. Improve a model.
6. Save and load a trained model.
7. Post-Processing Dashboards.

## Dependencies or Libraries used

Importing the NumPy, Pandas and scikit learn

In [1]:
import numpy as np
import pandas as pd

# Importing the Boston House Price dataset from the scikit library
from sklearn.datasets import load_boston
boston = load_boston()

## Exploratory Phase
 - Understanding the given dataset.

Link: https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset

Number of instances: 506

Number of missing values: 13 numeric/categorical

Creator: Harrison, D. and Rubinfeld, D.L.

Attributes:

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per \$10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in \$1000’s

In [2]:
# Getting data set into Pandas Data Frame

boston_df = pd.DataFrame(boston["data"], columns = boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
boston_df.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9


In [3]:
# How many samples?
len(boston_df)

506

Map of choosing the right model: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

- Are we predicting a category? `No.` We are predicting price.

- Are we predicting a quantity? `Yes.` We are predicting the price of boston house.

- Do we have more than 100k samples? `Yes.`

- Are few features more important than others? `No we don't know for sure.`

`Solution:` Try using ridge regression model.

## Ridge Regression Model

In [4]:
# Import ridge regression model

from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(1)

# Create the data
x = boston_df.drop("target", axis = 1)
y = boston_df["target"]

# Split into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

# Instantiate Model
model = Ridge()
model.fit(x_train, y_train)

# Check Evaluation score
accuracy1 = model.score(x_test, y_test)
print(f"Accuracy of the Ridge Regression is {(accuracy1 * 100).round(2)} %")

Accuracy of the Ridge Regression is 76.56 %


## Random Forest Regression Model

In [5]:
# Import Random Forest Regression model

from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(1)

# Create the data
x = boston_df.drop("target", axis = 1)
y = boston_df["target"]

# Split into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

# Instantiate Model
rf = RandomForestRegressor()
rf.fit(x_train, y_train)

# Check Evaluation score
accuracy2 = rf.score(x_test, y_test)
print(f"Accuracy of the Random Forest Regression is {(accuracy2 * 100).round(2)} %")


Accuracy of the Random Forest Regression is 91.25 %


In [6]:
# prediction array

y_preds = rf.predict(x_test)
y_preds[:10]

array([29.926, 27.022, 20.343, 20.593, 19.64 , 19.731, 28.09 , 18.868,
       20.441, 23.622])

In [7]:
# truth array

np.array(y_test[:10])

array([28.2, 23.9, 16.6, 22. , 20.8, 23. , 27.9, 14.5, 21.5, 22.6])

In [8]:
# Mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

2.2971568627451

## Comparing the accuracies of both the machine learning models!

In [9]:
print(f"Accuracy of the Ridge Regression is {(accuracy1 * 100).round(2)} %")
print(f"Accuracy of the Random Forest Regression is {(accuracy2 * 100).round(2)} %")

Accuracy of the Ridge Regression is 76.56 %
Accuracy of the Random Forest Regression is 91.25 %


 ## Regression model evaluation metrics

Model evaluation metrics documentation - https://scikit-learn.org/stable/modules/model_evaluation.html

1. R^2 (pronounced r-squared) or coefficient of determination.
2. Mean absolute error (MAE)
3. Mean squared error (MSE)

### **R^2 score**

What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1.  For example, if all your model does is predict the mean of the targets, it's R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

In [15]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Importing the Boston House Price dataset from the scikit library
from sklearn.datasets import load_boston
boston = load_boston()

# Getting data set into Pandas Data Frame
boston_df = pd.DataFrame(boston["data"], columns = boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])

# Create the feature matrix and label
x = boston_df.drop("target", axis=1)
y = boston_df["target"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(x_train, y_train);

In [16]:
# Return the coefficient of determination :math:`R^2` of the prediction.
model.score(x_test, y_test)

0.8654448653350507

In [29]:
# import r2 score
from sklearn.metrics import r2_score

# Fill an array with y_test mean
y_test_mean = np.full(len(y_test), y_test.mean())

In [20]:
y_test.mean()

21.488235294117654

In [21]:
# Model only predicting the mean gets an R^2 score of 0
r2_score(y_test, y_test_mean)

2.220446049250313e-16

In [22]:
# Model predicting perfectly the correct values gets an R^2 score of 1
r2_score(y_test, y_test)

1.0

### **Mean absolue error (MAE)**

MAE is the average of the aboslute differences between predictions and actual values. It gives you an idea of how wrong your models predictions are.

In [24]:
# Mean absolute error
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test, y_preds)
mae

2.136382352941176

In [25]:
df = pd.DataFrame(data={"actual values": y_test,
                        "predicted values": y_preds})
df["differences"] = df["predicted values"] - df["actual values"]
df

Unnamed: 0,actual values,predicted values,differences
173,23.6,23.081,-0.519
274,32.4,30.574,-1.826
491,13.6,16.759,3.159
72,22.8,23.460,0.660
452,16.1,16.893,0.793
...,...,...,...
412,17.9,13.159,-4.741
436,9.6,12.476,2.876
411,17.2,13.612,-3.588
86,22.5,20.205,-2.295


### **Mean squared error (MSE)**



In [27]:
# Mean squared error
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

9.867437068627442

In [28]:
# Calculate MSE by hand
squared = np.square(df["differences"])
squared.mean()

9.867437068627439