<a href="https://colab.research.google.com/github/SinothileB/Prediction-of-Product-Sales/blob/main/Ensemble_Trees_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name :Sinothile Blose

You will use the Boston Housing Data that you used in the regression Trees practice assignment. Let's see if we can improve the results model using ensemble tree models! The target is PRICE. This data set is all numeric and clean. Data does not need to be scaled for tree models, so you do not need to perform preprocessing steps. Be sure to split the data before modeling.

Your task is to create the best possible model to predict house prices.

Imports

In [18]:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
# Set pandas as the default output for sklearn
from sklearn import set_config
set_config(transform_output='pandas')


In [2]:
# import linear regression model
from sklearn.linear_model import LinearRegression
# import regression metrics needed from sklearn
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

Custom Functions

In [3]:
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

In [4]:
def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)

  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(3)

Load Data

In [6]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
fpath = "/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week06/Data/Boston_Housing_from_Sklearn - Boston_Housing_from_Sklearn.csv"
df = pd.read_csv(fpath)
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
0,0.00632,0.538,6.575,65.2,15.3,4.98,24.0
1,0.02731,0.469,6.421,78.9,17.8,9.14,21.6
2,0.02729,0.469,7.185,61.1,17.8,4.03,34.7
3,0.03237,0.458,6.998,45.8,18.7,2.94,33.4
4,0.06905,0.458,7.147,54.2,18.7,5.33,36.2


Train test split

In [9]:
X = df.drop(columns='PRICE')
y = df['PRICE']

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state = 42)
X_train.head()

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT
182,0.09103,0.488,7.155,92.2,17.8,4.82
155,3.53501,0.871,6.152,82.6,14.7,15.02
280,0.03578,0.4429,7.82,64.5,14.9,3.76
126,0.38735,0.581,5.613,95.6,19.1,27.26
329,0.06724,0.46,6.333,17.2,16.9,7.34


1. Train and evaluate a default Bagged Trees

In [15]:
## Instantiate a Default Model
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(random_state = 42)
#Fit the model(no pipeline sionce there was no pre processing)
bagreg.fit(X_train, y_train)
# Call custom function for evaluation
evaluate_regression(bagreg,X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 1.103
- MSE = 3.487
- RMSE = 1.867
- R^2 = 0.961

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 2.316
- MSE = 12.575
- RMSE = 3.546
- R^2 = 0.820


2. Use GridSearchCV to tune the Bagged Tree model to optimize performance on the test set.

In [16]:
# Obtain list of parameters
bagreg.get_params()

{'base_estimator': 'deprecated',
 'bootstrap': True,
 'bootstrap_features': False,
 'estimator': None,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [25]:
# Define parameters to tune
param_grid = {'n_estimators': [5, 10, 20, 30, 40, 50,60,70],
              'max_samples' : [.5, .7, .9,.10,.11,.12],
              'max_features': [.5, .7, .9,.10,.11,.12]}
# Instaniate the gridsearch
gridsearch = GridSearchCV(bagreg, param_grid, n_jobs=-1, verbose=1)
# Fit the gridsearch on the training data
gridsearch.fit(X_train, y_train)
# Obtain the best paramters from the gridsearch
gridsearch.best_params_

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


{'max_features': 0.9, 'max_samples': 0.7, 'n_estimators': 50}

 - define and evaluate the model with the best results

In [23]:
best_bagreg_grid = gridsearch.best_estimator_
# Evalute the tuned model
evaluate_regression(best_bagreg_grid, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 1.303
- MSE = 4.025
- RMSE = 2.006
- R^2 = 0.955

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 2.278
- MSE = 13.432
- RMSE = 3.665
- R^2 = 0.808


3. Train and evaluate a default random forest

In [26]:
from sklearn.ensemble import RandomForestRegressor



In [27]:
## Instantiate a Default Model
rf= RandomForestRegressor(random_state = 42)
#Fit the model(no pipeline sionce there was no pre processing)
rf.fit(X_train, y_train)
# Call custom function for evaluation
evaluate_regression(rf,X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 0.954
- MSE = 2.028
- RMSE = 1.424
- R^2 = 0.977

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 2.208
- MSE = 11.635
- RMSE = 3.411
- R^2 = 0.834


4. Use GridSearchCV to tune the Random Forest model to optimize performance on the test set.

In [28]:
# Parameters for tuning
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [38]:
# Define param grid with options to try
params = {'max_depth': [None,25,30,35,40],
          'n_estimators':[200,250,300,350],
          'min_samples_leaf':[5,6,7],
          'max_features':['sqrt','log2',None],
          'oob_score':[True,False],
          }

In [40]:
# Instantiate the gridsearch
gridsearch = GridSearchCV(rf, params, n_jobs=-1, cv = 3, verbose=1)
# Fit the gridsearch on training data
gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 360 candidates, totalling 1080 fits


In [41]:
# Obtain best parameters
gridsearch.best_params_

{'max_depth': None,
 'max_features': None,
 'min_samples_leaf': 5,
 'n_estimators': 350,
 'oob_score': True}

 - Evaluate the best Random Forest model's performance.

In [42]:
# Define and refit best model
best_rf = gridsearch.best_estimator_
evaluate_regression(best_rf, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 1.781
- MSE = 8.259
- RMSE = 2.874
- R^2 = 0.907

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 2.214
- MSE = 13.950
- RMSE = 3.735
- R^2 = 0.801


5. Which model and model parameters provided the best results?

Random forest Model provided the best results.

The test data has a score of 83% ,with the default parameters as below,tuning the model did not seem to do any improvements.

max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,

 while when using the Bagged Tree model the best estimater for test data was with the default parameters {'base_estimator': 'deprecated',
 'bootstrap': True,
 'bootstrap_features': False,
 'estimator': None,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False} gave 82% 1 %less than Random Forest

6. Explain in a text cell how your model will perform if deployed by referring to the metrics. Ex. How close can your stakeholders expect its predictions to be to the true value?

With the default parameters:

Bagged tree model - we see a little overfit but it is not bad.

- The R2 on the test data 82%
- Tuning the default parameters does not seem to improve the score
-  the error on MAE is about 2.3 THOUSAND, while it is 3.5 THOUSAND with RMSE.

Random forest Model- we also see a little overfit but it is not bad as well.

- The R2 on the test data 83%
- Tuning the default parameters does not seem to improve the score
-  the error on MAE is about 2.2 THOUSAND, while it is 3.4 THOUSAND with RMSE.
