### Boston Housing Data

In order to gain a better understanding of the metrics used in regression settings, we will be looking at the Boston Housing dataset.  

First use the cell below to read in the dataset and set up the training and testing data that will be used for the rest of this problem.

In [1]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np

boston = load_boston()
y = boston.target
X = boston.data

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)
print(X_test)

[[9.17800e-02 0.00000e+00 4.05000e+00 ... 1.66000e+01 3.95500e+02
  9.04000e+00]
 [5.64400e-02 4.00000e+01 6.41000e+00 ... 1.76000e+01 3.96900e+02
  3.53000e+00]
 [1.05740e-01 0.00000e+00 2.77400e+01 ... 2.01000e+01 3.90110e+02
  1.80700e+01]
 ...
 [7.61620e-01 2.00000e+01 3.97000e+00 ... 1.30000e+01 3.92400e+02
  1.04500e+01]
 [1.00245e+00 0.00000e+00 8.14000e+00 ... 2.10000e+01 3.80230e+02
  1.19800e+01]
 [5.20140e-01 2.00000e+01 3.97000e+00 ... 1.30000e+01 3.86860e+02
  5.91000e+00]]


> **Step 1:** Before we get too far, let's do a quick check of the models that you can use in this situation given that you are working on a regression problem.
- linear regression - only regression problems.
- logistic regression - only classification problems.
- decision trees, random forest, adaptive boosting - both regression and classification.

> **Step 2:** Now for each of the models you found in the previous question that can be used for regression problems, import them using sklearn.

In [2]:
# Import models from sklearn - notice you will want to use 
# the regressor version (not classifier) - googling to find 
# each of these is what we all do!
from sklearn.linear_model import LinearRegression
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

> **Step 3:** Now that you have imported the 4 models that can be used for regression problems, instantate each below.

In [3]:
# Instantiate each of the models you imported
# For now use the defaults for all the hyperparameters
linearReg = LinearRegression()
treeReg = tree.DecisionTreeRegressor()
RFReg = RandomForestRegressor()
AdaBReg = AdaBoostRegressor()

> **Step 4:** Fit each of your instantiated models on the training data.

In [4]:
# Fit each of your models using the training data
linearReg.fit(X_train,y_train)
treeReg.fit(X_train,y_train)
RFReg.fit(X_train,y_train)
AdaBReg.fit(X_train,y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

> **Step 5:** Use each of your models to predict on the test data.

In [5]:
# Predict on the test values for each model
linPred = linearReg.predict(X_test)
treePred = treeReg.predict(X_test)
RFPred = RFReg.predict(X_test)
AdaBPred = AdaBReg.predict(X_test)

> **Step 6:** Metrics used for classification and regression:
- Regression: mean_squared_error, mean_absolute_area, r2_score
- Classification: precision, recall, accuracy, area under the curve

> **Step 6:** Now that you have identified the metrics that can be used in for regression problems, use sklearn to import them.

In [6]:
# Import the metrics from sklearn
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

> **Step 7:** r2 score

In [7]:
print('r2 score for linear regression model: {}'.format(r2_score(y_test, linPred)))
print('r2 score for tree regression model: {}'.format(r2_score(y_test, treePred)))
print('r2 score for RF regression model: {}'.format(r2_score(y_test, RFPred)))
print('r2 score for Adaboost regression model: {}'.format(r2_score(y_test, AdaBPred)))

r2 score for linear regression model: 0.7261570836552487
r2 score for tree regression model: 0.754078687806155
r2 score for RF regression model: 0.8653314778605486
r2 score for Adaboost regression model: 0.8032720092889251


> **Step 8:** mean_squared_error. 

In [8]:
print('mean_squared_error score for linear regression model: {}'.format(mean_squared_error(y_test, linPred)))
print('mean_squared_error score for tree regression model: {}'.format(mean_squared_error(y_test, treePred)))
print('mean_squared_error score for RF regression model: {}'.format(mean_squared_error(y_test, RFPred)))
print('mean_squared_error score for Adaboost regression model: {}'.format(mean_squared_error(y_test, AdaBPred)))

mean_squared_error score for linear regression model: 20.72402343733967
mean_squared_error score for tree regression model: 18.610958083832337
mean_squared_error score for RF regression model: 10.191512880239518
mean_squared_error score for Adaboost regression model: 14.888080892128585


> **Step 9:** mean absolute error.

In [9]:
print('mean absolute error score for linear regression model: {}'.format(mean_absolute_error(y_test, linPred)))
print('mean absolute error score for tree regression model: {}'.format(mean_absolute_error(y_test, treePred)))
print('mean absolute error score for RF regression model: {}'.format(mean_absolute_error(y_test, RFPred)))
print('mean absolute error score for Adaboost regression model: {}'.format(mean_absolute_error(y_test, AdaBPred)))

mean absolute error score for linear regression model: 3.1482557548168115
mean absolute error score for tree regression model: 2.913173652694611
mean absolute error score for RF regression model: 2.135131736526945
mean absolute error score for Adaboost regression model: 2.7452007178596594


> *Conclusion:** We can see that the Random forest model performed best in all the three metrics.