### HW 4: Build and evaluate regression models

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing
    
    

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

In [20]:
from sklearn.metrics import mean_squared_error as mse, r2_score as r_2

# Follow this link to find more metrics for regression:
# https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics
def regression_metrics(y_train, y_train_pred, y_test, y_test_pred):
    '''report the mse and r2 scores for both the training and test sets
    '''
    print('MSE Train: ', round(mse(y_train, y_train_pred), 3))
    print('MSE Test: ', round(mse(y_test, y_test_pred), 3))
    print('R^2 Train: ', round(r_2(y_train, y_train_pred), 3))
    print('R^2 Test: ', round(r_2(y_test, y_test_pred), 3))

#### Load the dataset

In [19]:
#Load the California housing dataset

ds = fetch_california_housing()
X = ds.data
y = ds.target
print('dataset size:', X.shape, y.shape)
print('Data in ds', dir(ds))
print(ds.DESCR)

dataset size: (20640, 8) (20640,)
Data in ds ['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house 

In [27]:
print("Features:\n", X)
print("Targets:\n", y)
print("X mean:\n", np.mean(X, axis = 0))
print("y mean:\n", np.mean(y))

Features:
 [[   8.3252       41.            6.98412698 ...    2.55555556
    37.88       -122.23      ]
 [   8.3014       21.            6.23813708 ...    2.10984183
    37.86       -122.22      ]
 [   7.2574       52.            8.28813559 ...    2.80225989
    37.85       -122.24      ]
 ...
 [   1.7          17.            5.20554273 ...    2.3256351
    39.43       -121.22      ]
 [   1.8672       18.            5.32951289 ...    2.12320917
    39.43       -121.32      ]
 [   2.3886       16.            5.25471698 ...    2.61698113
    39.37       -121.24      ]]
Targets:
 [4.526 3.585 3.521 ... 0.923 0.847 0.894]
X mean:
 [ 3.87067100e+00  2.86394864e+01  5.42899974e+00  1.09667515e+00
  1.42547674e+03  3.07065516e+00  3.56318614e+01 -1.19569704e+02]
y mean:
 2.068558169089147


In [6]:
# data split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
print('training set size:', X_train.shape[0])
print('test set size:', X_test.shape[0])

training set size: 14448
test set size: 6192


#### 1 Train and evaluate a linear regression model using OLS. 20 points.
    1) Train a linear reguression model
    2) Evalate the model and print out the RMS and $(R^2)$ values on the training and test sets
    3) What issues can you observe from the results? any possible solutions ?

In [30]:
# DOUBLE CHECK!!!
# I was confused because it said to print root mean square which was weird to do 
# So I did root mean square error (RMSE)
# Is that what I should do?

#OLS: ordinary least squares. 

from sklearn.linear_model import LinearRegression as LR

#1) train. 5 points

LR_model = LR()
LR_model.fit(X_train, y_train)

#2) evaluate. 5 points 

y_train_pred = LR_model.predict(X_train)
y_test_pred = LR_model.predict(X_test)

# calculate and print R^2 and root mean square error (RMSE)
#DELETE????

def calc_MSE(y_actual, y_pred):
    return np.mean((y_actual - y_pred) ** 2)

def calc_RMSE(y_actual, y_pred):
    return np.sqrt(calc_MSE(y_actual, y_pred))

def calc_RSS(y_actual, y_pred):
    RSS = np.sum((y_actual - y_pred) ** 2)
    return RSS

def calc_TSS(y_actual, y_pred):
    TSS = np.sum((y_actual - np.mean(y_actual)) ** 2)
    return TSS

def calc_R_Squared(y_actual, y_pred):
    RSS = calc_RSS(y_actual, y_pred)
    TSS = calc_TSS(y_actual, y_pred)
    return 1 - (RSS / TSS)

print("Training Set:")
print("MSE:", calc_MSE(y_train, y_train_pred))
print("RMSE:", calc_RMSE(y_train, y_train_pred))
print("RSS:", calc_RSS(y_train, y_train_pred))
print("TSS:", calc_TSS(y_train, y_train_pred))
print("R^2:", calc_R_Squared(y_train, y_train_pred))

print()

print("Test Set:")
print("MSE:", calc_MSE(y_test, y_test_pred))
print("RMSE:", calc_RMSE(y_test, y_test_pred))
print("RSS:", calc_RSS(y_test, y_test_pred))
print("TSS:", calc_TSS(y_test, y_test_pred))
print("R^2:", calc_R_Squared(y_test, y_test_pred))

# evaluating linear regression with regression metrics function from above

print()
print("Regression metrics function output:")
regression_metrics(y_train, y_train_pred, y_test, y_test_pred)


Training Set:
MSE: 0.5173003362697665
RMSE: 0.7192359392228439
RSS: 7473.955258425586
TSS: 19227.7912639894
R^2: 0.6112941337977223

Test Set:
MSE: 0.5431489670037247
RMSE: 0.7369864089681198
RSS: 3363.1784036870636
TSS: 8255.40224389771
R^2: 0.592608778551877

Regression metrics function output:
MSE Train:  0.517
MSE Test:  0.543
R^2 Train:  0.611
R^2 Test:  0.593


##### 3) Question: What issues can you observe from the results? any possible solutions?. 10 points

Response:

To me the MSE seems relatively high considering the the mean of the targets is 2.0686 Likewise I think the R^2 also could be improved upon. It looks like the R^2 is indicating that the linear regression model is slightly over fitting because the test R^2 is slightly lower than the R^2 for the training. I think in future models the MSE as well as the R^2 could be improved. 

To improve the MSE and R^2 of the model, there are several potential things I could try. I could see if using different combinations of variables in the model would help by using the ones that are most relevant. Changing the target variable to be nonlinear might also help if the true relationship is actually nonlinear. I could also preprocess the data and see if any outliers could be removed. 

If removing the outliers does not help address the slight overfitting of the model, I could implement regularization techniques in my linear regression model. Right now I have a hunch that ridge regularization would help give the best result of the all the regularization techniques because it seems to me that all of the features currently in the model would be important in predicting the median house values for California districts, but I wouldn't know until I try and compare its results.

#### 2 Train and evaluate the decision tree approach? 50 points.

    1) Train a decision tree model. Please tune the arguments, 'criterion', 'max_depth', and 'min_samples_leaf', to achieve good performance
    2) Print out the depth of the tree, the number of leaves, and the importance of each feature
    3) Evalate the model and print out the RMS and $R^2$ values on the training and test sets
    4) Show the tree using tree.export_text() in sklearn
    5) Print out the decision path for data sample X_test[0]
    6) Test different max_depth values and analyse the results?

In [101]:
# Decision tree

from sklearn.tree import DecisionTreeRegressor as DTR

#1) train. 10 points

# The parameters set here seem to be 
# the best ones of the ones tried when tuning:

set_criterion = "squared_error" # possible criterions: 'friedman_mse', 'poisson', 'squared_error', 'absolute_error'
set_max_depth = 20
set_min_samples_leaf = 20

DTR_model = DTR(criterion = set_criterion, max_depth = set_max_depth, min_samples_leaf = set_min_samples_leaf, random_state = 0)
DTR_model.fit(X_train, y_train)

#2) pirnt out tree attributes: the depth of the tree, 
#the number of leaves, and the importance of each feature. 10 points

print("Depth of the tree:\n", DTR_model.get_depth())
print("Number of leaves:\n", DTR_model.get_n_leaves())
print("Feature importances:\n", DTR_model.feature_importances_)

#3) evaluate. 5 points

y_train_pred = DTR_model.predict(X_train)
y_test_pred = DTR_model.predict(X_test)

print()

print("Training Set:")
print("MSE:", calc_MSE(y_train, y_train_pred))
print("RMSE:", calc_RMSE(y_train, y_train_pred))
print("RSS:", calc_RSS(y_train, y_train_pred))
print("TSS:", calc_TSS(y_train, y_train_pred))
print("R^2:", calc_R_Squared(y_train, y_train_pred))

print()

print("Test Set:")
print("MSE:", calc_MSE(y_test, y_test_pred))
print("RMSE:", calc_RMSE(y_test, y_test_pred))
print("RSS:", calc_RSS(y_test, y_test_pred))
print("TSS:", calc_TSS(y_test, y_test_pred))
print("R^2:", calc_R_Squared(y_test, y_test_pred))

# evaluating decision tree regression with regression metrics function from above

print()
print("Regression metrics function output:")
regression_metrics(y_train, y_train_pred, y_test, y_test_pred)


Depth of the tree:
 19
Number of leaves:
 556
Feature importances:
 [0.63064807 0.03804683 0.03416392 0.0072     0.0055091  0.14080278
 0.06930243 0.07432687]

Training Set:
MSE: 0.2575971384662756
RMSE: 0.5075402826045196
RSS: 3721.7634565607495
TSS: 19227.7912639894
R^2: 0.8064383264066829

Test Set:
MSE: 0.38723119722852467
RMSE: 0.622279034861793
RSS: 2397.7355732390247
TSS: 8255.40224389771
R^2: 0.7095555731386195

Regression metrics function output:
MSE Train:  0.258
MSE Test:  0.387
R^2 Train:  0.806
R^2 Test:  0.71


In [99]:
#4) explore the trained decision tree using export_text(). 5 points

from sklearn import tree
from sklearn.tree import export_text

DTR_model_text = export_text(DTR_model, feature_names = ds.feature_names)
print(type(DTR_model_text))
print("Decision Tree:\n", DTR_model_text)


<class 'str'>
Decision Tree:
 |--- MedInc <= 5.03
|   |--- MedInc <= 3.07
|   |   |--- AveRooms <= 4.22
|   |   |   |--- AveOccup <= 2.50
|   |   |   |   |--- MedInc <= 2.19
|   |   |   |   |   |--- AveRooms <= 3.33
|   |   |   |   |   |   |--- Population <= 1208.00
|   |   |   |   |   |   |   |--- Latitude <= 37.51
|   |   |   |   |   |   |   |   |--- Latitude <= 34.01
|   |   |   |   |   |   |   |   |   |--- value: [1.63]
|   |   |   |   |   |   |   |   |--- Latitude >  34.01
|   |   |   |   |   |   |   |   |   |--- value: [2.29]
|   |   |   |   |   |   |   |--- Latitude >  37.51
|   |   |   |   |   |   |   |   |--- value: [1.36]
|   |   |   |   |   |   |--- Population >  1208.00
|   |   |   |   |   |   |   |--- value: [2.62]
|   |   |   |   |   |--- AveRooms >  3.33
|   |   |   |   |   |   |--- AveOccup <= 1.97
|   |   |   |   |   |   |   |--- Longitude <= -118.19
|   |   |   |   |   |   |   |   |--- Latitude <= 37.95
|   |   |   |   |   |   |   |   |   |--- value: [2.58]
|   |   | 

In [100]:
#5) print out the decision path for the first test data sample X_test[0]. 5 points

print(X_test[0:1])
first_test_data_sample_decision_path = DTR_model.decision_path([X_test[0]])
print("Decision path for sample X_test[0]:")
print(first_test_data_sample_decision_path)


[[ 4.15180000e+00  2.20000000e+01  5.66307278e+00  1.07547170e+00
   1.55100000e+03  4.18059299e+00  3.25800000e+01 -1.17050000e+02]]
Decision path for sample X_test[0]:
  (0, 0)	1
  (0, 1)	1
  (0, 421)	1
  (0, 519)	1
  (0, 703)	1
  (0, 779)	1
  (0, 785)	1
  (0, 813)	1
  (0, 865)	1
  (0, 871)	1


##### 6) Question: Test different max_depth values (5, 8, 10, 20) and analyse the results? 15 points.
 - SET criterion = 'squared_error' and min_samples_leaf = 20
 - Compare the results with different max_depth values
 - Summarize the main disadvantages decision trees

##### Results at differnt max_depth_values:

At max_depth = 5: 
- MSE Train: 0.491
- MSE Test: 0.538
- R^2 Train: 0.631
- R^2 Test: 0.596

At max_depth = 8:
- MSE Train: 0.343
- MSE Test: 0.434
- R^2 Train: 0.743
- R^2 Test: 0.675

At max_depth = 10:
- MSE Train: 0.29
- MSE Test: 0.4
- R^2 Train: 0.782
- R^2 Test: 0.7

At max_depth = 20:
- MSE Train: 0.258
- MSE Test: 0.387
- R^2 Train: 0.806
- R^2 Test: 0.71

##### Response: 

The decision tree regression model improved its training and test MSE and R^2 when the max_depth values increased leaving all other hyperparameters the same. However, the improvement maxed out at a max_depth value of 20. So the best hyperparameter values of the ones that I tested were max_depth = 20, criterion = 'squared_error', and min_samples_leaf = 20. One other thing to note that came with this improvement in training and test MSE and R^2 is that overfitting became more drastic with the model as the gap between the training and test R^2 grew in magnitude.  

During the hyperparameter tuning, I also noticed that when I set the min_samples_leaf lower the degree of overfitting was less but this was also accompanied by a drop in R^2 for the test set.

#### 3. Random forests. 30 points
    1) What are the difference between bagging and random forests?
    2)Train a random forest model. Please tune the arguments, n_estimators, max_features, max_depth, to achieve good performance. 
    3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests?

##### 1) What is the main problem of bagging approach? and how random forests can address the problem? 5 points

Response: 

The main problem with bagging is that the all of the features are considered for splitting a node which means the created tree can be more similar to one another. This can lead to overfitting the training data. Random forest improves upon this by doing random feature selection at each node. This increases the variabilty between the created decision trees leads to a more effective model.

In [103]:
# Random forests

from sklearn.ensemble import RandomForestRegressor as RFR

# 2) train. 10 points

set_n_estimators = 100
set_max_features = "sqrt"
set_max_depth = 20
RFR_model = RFR(n_estimators = set_n_estimators, max_features = set_max_features, max_depth = set_max_depth, random_state = 0)
RFR_model.fit(X_train, y_train)

#3) evaluate. 5 points. 10 extra points for R^2 > 0.8

y_train_pred = RFR_model.predict(X_train)
y_test_pred = RFR_model.predict(X_test)

print()

print("Training Set:")
print("MSE:", calc_MSE(y_train, y_train_pred))
print("RMSE:", calc_RMSE(y_train, y_train_pred))
print("RSS:", calc_RSS(y_train, y_train_pred))
print("TSS:", calc_TSS(y_train, y_train_pred))
print("R^2:", calc_R_Squared(y_train, y_train_pred))

print()

print("Test Set:")
print("MSE:", calc_MSE(y_test, y_test_pred))
print("RMSE:", calc_RMSE(y_test, y_test_pred))
print("RSS:", calc_RSS(y_test, y_test_pred))
print("TSS:", calc_TSS(y_test, y_test_pred))
print("R^2:", calc_R_Squared(y_test, y_test_pred))

# evaluating random forest regression regression with regression metrics function from above

print()
print("Regression metrics function output:")
regression_metrics(y_train, y_train_pred, y_test, y_test_pred)


Training Set:
MSE: 0.038747197256630984
RMSE: 0.1968430777462875
RSS: 559.8195059638044
TSS: 19227.7912639894
R^2: 0.9708848770887035

Test Set:
MSE: 0.25146485044561057
RMSE: 0.5014627109223682
RSS: 1557.0703539592205
TSS: 8255.40224389771
R^2: 0.8113877061399174

Regression metrics function output:
MSE Train:  0.039
MSE Test:  0.251
R^2 Train:  0.971
R^2 Test:  0.811


##### 3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests? 10 points

##### Response: 

My random forest model achieves a better performance than the best decision tree I created. The test R^2 for the random forest model is 0.811 whereas the the best decision tree model I created has a test R^2 of 0.71 which means the random forest had an improvement in test R^2 of 0.101 which is pretty good.

Random forests share the same advantage of decision trees as being able to perform both regression and classification tasks as well as being easier to interpret because the models are very transparent with what is going on within the model. Random forests can also easily be scaled to handle large datasets. Random forests also make predictions better than standard decision trees because of the use of bagging and random feature selection done at each node when creating the trees. 