# HW 4: Build and evaluate regression models
### Dan Blanchette
### Python for Machine Learning
### Due 4-24-2023 

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing
    
    

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

In [None]:
from sklearn.metrics import mean_squared_error as mse, r2_score as r_2

# Follow this link to find more metrics for regression:
# https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics
def regression_metrics(y_train, y_train_pred, y_test, y_test_pred):
    '''report the mse and r2 scores for both the training and test sets
    '''
    print('MSE Train: ', round(mse(y_train, y_train_pred), 3))
    print('MSE Test: ', round(mse(y_test, y_test_pred), 3))
    print('R^2 Train: ', round(r_2(y_train, y_train_pred), 3))
    print('R^2 Test: ', round(r_2(y_test, y_test_pred), 3))

#### Load the dataset

In [None]:
#Load the California housing dataset
ds = fetch_california_housing()
X = ds.data
y = ds.target
print('dataset size:', X.shape, y.shape)
print('Data in ds', dir(ds))
print(ds.DESCR)

dataset size: (20640, 8) (20640,)
Data in ds ['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house 

In [None]:
# data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
print('training set size:', X_train.shape[0])
print('test set size:', X_test.shape[0])

training set size: 14448
test set size: 6192


#### 1 Train and evaluate a linear regression model using OLS. 20 points.
    1) Train a linear reguression model
    2) Evalate the model and print out the RMS and $R^2$ values on the training and test sets
    3) What issues can you observe from the results? any possible solutions ?

In [None]:
#OLS: ordinary least squares. 
from sklearn.linear_model import LinearRegression as LR

# create the OLS Model
slr = LR()

#1) train. 5 points
slr.fit(X_train, y_train)

#2) evaluate. 5 points 
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
regression_metrics(y_train, y_train_pred, y_test, y_test_pred)


MSE Train:  0.517
MSE Test:  0.543
R^2 Train:  0.611
R^2 Test:  0.593


### 3) Question: What issues can you observe from the results? any possible solutions?. 10 points

### Response: 
### MSE Values:
##### I observed that the Mean Squared Error values for both the Training set and the Test set were relatively high. Based on these results, it is reasonable to conclude that the model's quality(loss incurred) is not low enough to consider the model accurate. For the model to be considered accurate, the MSE should be low.

### R^2 Values:
##### The R^2 values are very closely related between the training set and the test set. To determine the model we conisder these values based on how high they are. The higher the value, the closer to the regression line the data is(goodness of fit). For tests a R^2 value of 95% or higher is ideal. However, th training and the test set are not sufficient and reflects the results of the MSE values by considering the model as inaccurate.

### How to Improve the Model:

We could alternatively choose to use a ridge linear regression model instaed.
OLS doesn't evaluate which independent variable is more important than the others. This approach finds the "best" unbiased coefficients from the data set. However, the trade-off is that it increases the variance of the model.

As we discussed in lecture, perhaps a ridge regression approach may be better. This model reduces variation but adds bias to the model. If we used this approach, there that the MSE and R^2 values may show improvement. 

Alternatively, for OLS, we could add code that would handle null values in our data set by replacing them with mean, median, or mode values in columns that have such values.

#### 2 Train and evaluate the decision tree approach? 50 points.

    1) Train a decision tree model. Please tune the arguments, 'criterion', 'max_depth', and 'min_samples_leaf', to achieve good performance
    2) Print out the depth of the tree, the number of leaves, and the importance of each feature
    3) Evalate the model and print out the RMS and  𝑅2  values on the training and test sets
    4) Show the tree using tree.export_text() in sklearn
    5) Print out the decision path for data sample X_test[0]
    6) Test different max_depth values and analyse the results?

In [None]:
# Decision tree
from sklearn.tree import DecisionTreeRegressor as DTR

dtr = DTR(criterion = "squared_error")


dtr.fit(X_train, y_train)

#1) train. 10 points
y_train_pred = dtr.predict(X_train)
y_test_pred = dtr.predict(X_test)

#2) print out tree attributes: the depth of the tree, 
#the number of leaves, and the importance of each feature. 10 points
print('Tree Attributes:')

print("Tree Max Depth: ", dtr.tree_.max_depth)

print("Tree Number of Leaves: ", dtr.tree_.n_leaves)

print("Feature Importances:", dtr.feature_importances_)


#3) evaluate. 5 points
print("\n\nEvaluation:")
regression_metrics(y_train, y_train_pred, y_test, y_test_pred)


Tree Attributes:
Tree Max Depth:  37
Tree Number of Leaves:  13883
Feature Importances: [0.52846844 0.05154746 0.05059777 0.02649499 0.02872664 0.13715165
 0.08855332 0.08845972]


Evaluation:
MSE Train:  0.0
MSE Test:  0.544
R^2 Train:  1.0
R^2 Test:  0.592


In [None]:
#4) explore the trained decision tree using export_text(). 5 points
from sklearn import tree
print('depth', dtr.get_depth())
print('subregions', dtr.get_n_leaves)

t = tree.export_text(dtr)
print(t)

depth 37
subregions <bound method BaseDecisionTree.get_n_leaves of DecisionTreeRegressor()>
|--- feature_0 <= 5.03
|   |--- feature_0 <= 3.07
|   |   |--- feature_2 <= 4.22
|   |   |   |--- feature_5 <= 2.50
|   |   |   |   |--- feature_0 <= 2.19
|   |   |   |   |   |--- feature_2 <= 3.33
|   |   |   |   |   |   |--- feature_4 <= 1208.00
|   |   |   |   |   |   |   |--- feature_6 <= 37.51
|   |   |   |   |   |   |   |   |--- feature_6 <= 34.01
|   |   |   |   |   |   |   |   |   |--- feature_7 <= -115.87
|   |   |   |   |   |   |   |   |   |   |--- feature_5 <= 1.51
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |   |--- feature_5 >  1.51
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 7
|   |   |   |   |   |   |   |   |   |--- feature_7 >  -115.87
|   |   |   |   |   |   |   |   |   |   |--- value: [0.25]
|   |   |   |   |   |   |   |   |--- feature_6 >  34.01
|   |   |   |   |   |   |   |   |

In [None]:
#5) print out the decision path for the first test data sample X_test[0]. 5 points
print(X_test[0:1])


[[ 4.15180000e+00  2.20000000e+01  5.66307278e+00  1.07547170e+00
   1.55100000e+03  4.18059299e+00  3.25800000e+01 -1.17050000e+02]]


##### 6) Question: Test different max_depth values (5, 8, 10, 20) and analyse the results? 15 points.
 - SET criterion = 'mse' and min_samples_leaf = 20
 - Compare the results with different max_depth values
 - Summarize the main disadvantages decision trees
 
Response:

## Test Different max_depth values with min_samples_leaf =20
### [ NOTE: See code block below this response for the model experiments and their data. ]

## Results Comparison
As we observe, in the results of each of the tests the RMSE decreases as the depth increases. We also observe that the R^2 indicates that it is increasing in goodness of fit with the calculated regression line. This is governed by a couple of attributes concerning our model, the max depth and the min_samples_leaf parameters set for this experiment.

If we compare these results to our original data from the previous part in the assignment, we notice that the error value is quite high and the R^2 value is low in the original tree model. With the other data sets we see improvement as the number levels of the decision tree are restricted along with the assignment of a terminating leaf node.

## Disadvantages to using Decision Tree Models
By having a combination of a regulated max depth(low value) with a min leaf value that is set at 50 or greater, the decision tree will see some improvement in fitting the data to a regression line equation.

When the depth of a tree is not set or goes to a large max depth value without a minimum leaf value, the tree can become unbalanced and skew the data which often results in the results from the model being overfit. If a minimum leaf value is set too low(num_leaves < 50) and the depth is set to a larger number value, this can also result in overfitting. 

Finding the best balance for a tree is important for this model to be effective and reporting a correct goodness of fit for the model.

In [20]:
''' Max Depth 5 '''
dtr1 = DTR(criterion = "squared_error", max_depth = 5, min_samples_leaf = 20)
dtr1.fit(X_train, y_train)

#1 train dtr1
y_train_pred1 = dtr1.predict(X_train)
y_test_pred1 = dtr1.predict(X_test)

# print out dtr1 tree attributes: 
print('dtr1 Tree Attributes:')

print("dtr1 Tree Max Depth: ", dtr1.tree_.max_depth)

print("dtr1 Tree Number of Leaves: ", dtr1.tree_.n_leaves)

print("dtr1 Feature Importances:\n", dtr1.feature_importances_)


# Evaluate dtr1 model
print("\ndtr1 Tree Evaluation:")
regression_metrics(y_train, y_train_pred1, y_test, y_test_pred1)


''' Max Depth 8 '''
dtr2 = DTR(criterion = "squared_error", max_depth = 8, min_samples_leaf = 20)
dtr2.fit(X_train, y_train)

#1 train dtr2
y_train_pred2 = dtr2.predict(X_train)
y_test_pred2 = dtr2.predict(X_test)

# print out dtr2 tree attributes: 
print('\n\ndtr2 Tree Attributes:')

print("dtr2 Tree Max Depth: ", dtr2.tree_.max_depth)

print("dtr2 Tree Number of Leaves: ", dtr2.tree_.n_leaves)

print("dtr2 Feature Importances:\n", dtr2.feature_importances_)


# Evaluate dtr2 tree
print("\ndtr2 Tree Evaluation:")
regression_metrics(y_train, y_train_pred2, y_test, y_test_pred2)


''' Max Depth 10 '''
dtr3 = DTR(criterion = "squared_error", max_depth = 10, min_samples_leaf = 20)
dtr3.fit(X_train, y_train)

#1 train dtr3
y_train_pred3 = dtr3.predict(X_train)
y_test_pred3 = dtr3.predict(X_test)

# print out tree attributes: 
print('\n\ndtr3 Tree Attributes:')

print("dtr3Tree Max Depth: ", dtr3.tree_.max_depth)

print("dtr3 Tree Number of Leaves: ", dtr3.tree_.n_leaves)

print("dtr3 Feature Importances:\n", dtr3.feature_importances_)


# Evaluate Dtr3 Tree
print("\ndtr3 tree Evaluation:")
regression_metrics(y_train, y_train_pred3, y_test, y_test_pred3)

''' Max Depth 20 '''
dtr4 = DTR(criterion = "squared_error", max_depth = 20, min_samples_leaf = 20)
dtr4.fit(X_train, y_train)

#1 train dtr4
y_train_pred4 = dtr4.predict(X_train)
y_test_pred4 = dtr4.predict(X_test)

# print out dtr4 tree attributes: 
print('\n\ndtr4 Tree Attributes:')

print("dtr4 Tree Max Depth: ", dtr4.tree_.max_depth)

print("dtr4 Tree Number of Leaves: ", dtr4.tree_.n_leaves)

print("dtr4 Feature Importances:\n", dtr4.feature_importances_)


# Evaluate dtr4 tree
print("\ndtr4 tree Evaluation:")
regression_metrics(y_train, y_train_pred4, y_test, y_test_pred4)




dtr1 Tree Attributes:
dtr1 Tree Max Depth:  5
dtr1 Tree Number of Leaves:  31
dtr1 Feature Importances:
 [7.80466862e-01 4.04161418e-02 2.31469730e-02 4.93890473e-04
 0.00000000e+00 1.36643364e-01 1.83106263e-02 5.22142151e-04]

dtr1 Tree Evaluation:
MSE Train:  0.491
MSE Test:  0.538
R^2 Train:  0.631
R^2 Test:  0.596


dtr2 Tree Attributes:
dtr2 Tree Max Depth:  8
dtr2 Tree Number of Leaves:  173
dtr2 Feature Importances:
 [0.67906533 0.0377389  0.02977116 0.00520338 0.00269657 0.14335303
 0.04733993 0.05483171]

dtr2 Tree Evaluation:
MSE Train:  0.343
MSE Test:  0.434
R^2 Train:  0.743
R^2 Test:  0.675


dtr3 Tree Attributes:
dtr3Tree Max Depth:  10
dtr3 Tree Number of Leaves:  336
dtr3 Feature Importances:
 [0.64836821 0.03750381 0.03236455 0.00651091 0.00351732 0.14211693
 0.06336835 0.06624992]

dtr3 tree Evaluation:
MSE Train:  0.29
MSE Test:  0.4
R^2 Train:  0.782
R^2 Test:  0.7


dtr4 Tree Attributes:
dtr4 Tree Max Depth:  19
dtr4 Tree Number of Leaves:  555
dtr4 Feature Impor

#### 3. Random forests. 30 points
    1) What are the difference between bagging and random forests?
    2)Train a random forest model. Please tune the arguments, n_estimators, max_features, max_depth, to achieve good performance. 
    3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests?

##### 1) What is the main problem of bagging approach? and how random forests can address the problem? 5 points

# Response:
## The Drawback of Using Bagging:
 The main drawback of using bagging is that the model suffers from loss of interpretability of a model. Also, despite bagging having a higher degree of accuracy, it is subject to being computationally expensive to conduct.

## How Random Forests Help Address The Bagging Problem:
Random forests produce accurate predictions just like bagging while generating a model that is easy to understand. It can also handle large data with the same high accuracy bagging offers.




In [157]:
# Random forests
from sklearn.ensemble import RandomForestRegressor as RFR

rf = RFR(n_estimators = 50, max_features=5, max_depth=20, min_samples_leaf=20, max_leaf_nodes=200, random_state=None)
# 2) train. 10 points
rf.fit(X_train, y_train)
y_train_predrf = rf.predict(X_train)
y_test_predrf = rf.predict(X_test)
print("Results of random forest model:")


#3) evaluate. 5 points. 10 extra points for R^2>0.8
regression_metrics(y_train, y_train_predrf, y_test, y_test_predrf)


Results of random forest model:
MSE Train:  0.246
MSE Test:  0.314
R^2 Train:  0.815
R^2 Test:  0.764


##### 3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests? 10 points

Response: 

One advatage a random forest has over a regular decision tree is the capability of handling large data sets. The other advantage a random forest model has is that it can produce predictions that are easily understood and at a higher accuracy than a standard decision tree. This is largely due to a random forest's functionality(from bagging) to perform both regression and classification tasks. 