### Boston Housing Data

In order to gain a better understanding of the metrics used in regression settings, we will be looking at the Boston Housing dataset.  

First use the cell below to read in the dataset and set up the training and testing data that will be used for the rest of this problem.

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
import tests2 as t

boston = load_boston()
y = boston.target
X = boston.data

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)

In [None]:
# When can you use the model - use each option as many times as necessary
a = 'regression'
b = 'classification'
c = 'both regression and classification'

models = {
    'decision trees':c, # Letter here,
    'random forest':c, # Letter here,
    'adaptive boosting':c, # Letter here,
    'logistic regression':b, # Letter here,
    'linear regression':a # Letter here
}

#checks your answer, no need to change this code
t.q1_check(models)

That's right!  All but logistic regression can be used for predicting numeric values.  And linear regression is the only one of these that you should not use for predicting categories.  Technically sklearn won't stop you from doing most of anything you want, but you probably want to treat cases in the way you found by answering this question!


In [None]:
# Import models from sklearn - notice you will want to use
# the regressor version (not classifier) - googling to find
# each of these is what we all do!
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

In [None]:
# Instantiate each of the models you imported
# For now use the defaults for all the hyperparameters
reg_mod= LinearRegression()
tree_mod= DecisionTreeRegressor()
rf_mod= RandomForestRegressor()
ada_mod= AdaBoostRegressor()

In [None]:
# Fit each of your models using the training data
reg_mod.fit(X_train, y_train)
tree_mod.fit(X_train, y_train)
rf_mod.fit(X_train, y_train)
ada_mod.fit(X_train, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None)

In [None]:
# Predict on the test values for each model
reg_preds= reg_mod.predict(X_test)
tree_preds= tree_mod.predict(X_test)
rf_preds= rf_mod.predict(X_test)
ada_preds= ada_mod.predict(X_test)

In [None]:
# potential model options
a = 'regression'
b = 'classification'
c = 'both regression and classification'

#
metrics = {
    'precision':b, # Letter here,
    'recall':b, # Letter here,
    'accuracy':b, # Letter here,
    'r2_score':a, # Letter here,
    'mean_squared_error':a, # Letter here,
    'area_under_curve':b, # Letter here,
    'mean_absolute_area':a # Letter here
}

#checks your answer, no need to change this code
t.q6_check(metrics)

That's right! Looks like you know your metrics!


In [None]:
# Import the metrics from sklearn
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [None]:
def r2(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the r-squared score as a float
    '''
    sse = np.sum((actual-preds)**2)
    sst = np.sum((actual-np.mean(actual))**2)
    return 1 - sse/sst

# Check solution matches sklearn
print(r2(y_test, tree_preds))
print(r2_score(y_test, tree_preds))
print("Since the above match, we can see that we have correctly calculated the r2 value.")

0.736729850468
0.736729850468
Since the above match, we can see that we have correctly calculated the r2 value.


In [None]:
def mse(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean squared error as a float
    '''

    return sum((actual-preds)**2)/len(actual)


# Check your solution matches sklearn
print(mse(y_test, tree_preds))
print(mean_squared_error(y_test, tree_preds))
print("If the above match, you are all set!")

19.9238922156
19.9238922156
If the above match, you are all set!


In [None]:
def mae(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean absolute error as a float
    '''

    return sum(np.abs(actual-preds))/len(actual)

# Check your solution matches sklearn
print(mae(y_test, tree_preds))
print(mean_absolute_error(y_test, tree_preds))
print("If the above match, you are all set!")

3.1119760479
3.1119760479
If the above match, you are all set!


In [None]:
#match each metric to the model that performed best on it
a = 'decision tree'
b = 'random forest'
c = 'adaptive boosting'
d = 'linear regression'


best_fit = {
    'mse':b, # letter here,
    'r2':b, # letter here,
    'mae':a # letter here
}

#Tests your answer - don't change this code
t.check_ten(best_fit)

Oops!  Actually the best model was the same for all the metrics.  Try again - all of your answers should be the same!


In [None]:
# cells for work

In [None]:
def print_metrics(y_true, preds, model_name=None):
    if model_name == None:
        print('Mean Square Error: ', format(mean_squared_error(y_true,preds)))
        print('Mean Absolute Error: ', format(mean_absolute_error(y_true,preds)))
        print('R2 Score: ', format(r2_score(y_true,preds)))
        print('\n\n')
    else:
        print('Mean Square Error for ',model_name,': ', format(mean_squared_error(y_true,preds)))
        print('Mean Absolute Error for ',model_name,': ', format(mean_absolute_error(y_true,preds)))
        print('R2 Score: ',model_name,': ', format(r2_score(y_true,preds)))
        print('\n\n')

In [None]:
print_metrics(y_test, tree_preds, 'tree')
print_metrics(y_test, rf_preds, 'Random Forest')
print_metrics(y_test, ada_preds, 'Adaboost')
print_metrics(y_test, reg_preds, 'Linear Regression')

Mean Square Error for  tree :  19.923892215568863
Mean Absolute Error for  tree :  3.111976047904192
R2 Score:  tree :  0.7367298504681555



Mean Square Error for  Random Forest :  10.616735329341317
Mean Absolute Error for  Random Forest :  2.2094610778443116
R2 Score:  Random Forest :  0.8597126772492982



Mean Square Error for  Adaboost :  14.300630055624195
Mean Absolute Error for  Adaboost :  2.745461844844545
R2 Score:  Adaboost :  0.8110344619209597



Mean Square Error for  Linear Regression :  20.747143360309067
Mean Absolute Error for  Linear Regression :  3.1512878365884154
R2 Score:  Linear Regression :  0.7258515818230032



