# Supervised Learning with Scikit-Learn
This notebook will proviude a basis for training supervised learning models with Keras. Main steps for this include:

1. Loading the data (e.g. from a CSV)
2. Pre-processing the data (making transformations to prepare data for machine learning)
3. Training/validating models using `sklearn` (see [this website](https://scikit-learn.org/stable/modules/classes.html) for great documentation on how to use this tool). We'll use cross-validation to train and tune the hyperparameters of our models.
4. Testing our models (make sure to leave a hold-out test dataset that we don't train models or tune their hyperparameters on).

## Installation and Package Dependency

In [None]:
# Installation of packages we need
!pip install scikit-learn matplotlib seaborn pandas kaggle lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
# Import packages we need
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

# Set visualiation theme to make plots look nice
sns.set_theme()

In [None]:
# IMPORTANT - first create API token with Kaggle
#1. Head here: https://www.kaggle.com/<your username>/account.
#2. Next, click on account tab and then "create new API token".
#3. Make sure you can navigate to the JSON called "kaggle.json".

# Set permissions for uploading Kaggle API token for downloading data
files.upload()
if not os.path.exists("~/.kaggle"):
  ! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

# Download the dataset using the Kaggle API
!kaggle datasets download -d petermcnallysg/universal-design-space-building-energy-simulation

# Unzip the dataset
!unzip universal-design-space-building-energy-simulation.zip -d udsb

Saving kaggle.json to kaggle.json
Downloading universal-design-space-building-energy-simulation.zip to /content
 30% 5.00M/16.6M [00:00<00:00, 47.5MB/s]
100% 16.6M/16.6M [00:00<00:00, 114MB/s] 
Archive:  universal-design-space-building-energy-simulation.zip
  inflating: udsb/Design Space Input Parameters.xlsx  
  inflating: udsb/Universal_Design_Space_Building_Energy_Simulation_input_output.csv  


## 1. Import The Dataset
Here we'll import our datasets using the `pandas library`.

In [None]:
# Define path to the dataset
data_csv = os.path.join("udsb", "Universal_Design_Space_Building_Energy_Simulation_input_output.csv")

# Load the CSV as a pandas DataFrame
data_df = pd.read_csv(data_csv)

  data_df = pd.read_csv(data_csv)


## 2. Preprocess the Data
During this step, we'll prep the dataset to make sure it's ready for machine learning models (needs to be numeric and features may need to be scaled/adjusted to promote model stability during training and hyperparameter tuning).

**This is arguably one of the most important steps to the machine learning process**.

In [None]:
# Get all categorical variables at once
categorical_variables = [
    "BuildingType", "ClimateZone", "TotalArea_Setting", "FloorArea_Setting",
    "PlateDepth_Setting", "FloorHeight_Setting", "SolarDesign",
    "Standard", "HVAC", "HVAC_Setting", "EnvelopeQuality_Setting", "LPD_Adjustment_Setting"]
categorical_df = data_df[categorical_variables]
one_hot_variables = pd.get_dummies(categorical_df)
one_hot_keys = list(one_hot_variables.keys())

# Concatenate the dataframe to have these new variables
data_df = pd.concat([data_df, one_hot_variables], axis=1)

In [None]:
# Get the keys (column headers)
dataset_keys = data_df.keys()
print(dataset_keys)

# Set your X and Y variables here
x_var = ['TotalArea', 'FloorArea', 'NumFloors', 'PlateDepth',
       'PlateLength', 'FloorHeight',
       'Height', 'WWR','Wall_R_Value', 'Roof_R_Value',
       'Glass_and_Frame_U_Value', 'SHGC', 'LPD_Adjustment'] + one_hot_keys
y_var = ['Cooling_Electricity_kBTU_per_sf']  # Start for now - below are other y variables to consider

# Other y variables we might care about
#'Interior_Lights_Final_W_per_sf', 'Exterior_Lights_Final_1_W',
#       'Exterior_Lights_Final_2_W', 'Setpoint_Setting', 'HeatingCoil',
#       'COP_Efficiency_Heating', 'CoolingCoil', 'COP_Efficiency_Cooling',
#       'EUI_kBTU_per_sf', 'Electricity_Facility_kBTU_per_sf',
#       'NaturalGas_Facility_kBTU_per_sf', 'Cooling_Electricity_kBTU_per_sf',
#       'Heating_Electricity_kBTU_per_sf', 'Heating_NaturalGas_kBTU_per_sf',
#       'Heating_Total_kBTU_per_sf', 'WaterSystems_Electricity_kBTU_per_sf',
#       'Lighting_Electricity_kBTU_per_sf', 'Equipment_Electricity_kBTU_per_sf',
#       'Fans_Electricity_kBTU_per_sf', 'Pumps_Electricity_kBTU_per_sf',
#       'HeatRejection_Electricity_kBTU_per_sf',
#       'HeatRecovery_Electricity_kBTU_per_sf'

# Next, split dataset into X and Y
X = data_df[x_var].values  # Turns from DataFrame into NumPy array
Y = data_df[y_var].values  # Turns from DataFrame into NumPy array

# Next, we need to shuffle the dataset
X, Y = shuffle(X, Y)
# Temporarily modify number of data points to take
n_samples = 10000
X, Y = X[:n_samples, :], Y[:n_samples, :]

# Next, split dataset into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Index(['ID', 'BuildingType', 'ClimateZone', 'TotalArea', 'TotalArea_Setting',
       'FloorArea', 'FloorArea_Setting', 'NumFloors', 'PlateDepth',
       'PlateDepth_Setting',
       ...
       'HVAC_Setting_Baseline', 'HVAC_Setting_Good', 'HVAC_Setting_Great',
       'HVAC_Setting_Ultra', 'EnvelopeQuality_Setting_Baseline',
       'EnvelopeQuality_Setting_HighPerformance',
       'EnvelopeQuality_Setting_UltraPerformance',
       'LPD_Adjustment_Setting_Base', 'LPD_Adjustment_Setting_Best',
       'LPD_Adjustment_Setting_Improved'],
      dtype='object', length=108)


## 3. Define our Machine Learning Models and Train/Tune Them

In [None]:
# Define the models you want to use
models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), ]

# Loop through each model and perform cross-validation
for model in models:
    scores = cross_val_score(model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
    print(model.__class__.__name__ + " CV Scores:")
    print(np.sqrt(-scores))
    print("Average RMSE:", np.mean(np.sqrt(-scores)))

LinearRegression CV Scores:
[15.20750592 14.16843119 14.76182391 17.25437681 15.69303223]
Average RMSE: 15.41703401280609
DecisionTreeRegressor CV Scores:
[3.4640456  3.82543938 3.08811653 3.60390278 3.3014069 ]
Average RMSE: 3.4565822358311116


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


RandomForestRegressor CV Scores:
[2.55081058 2.50861722 2.69518471 2.99998517 2.48502651]
Average RMSE: 2.6479248366141706


In [None]:
from lazypredict.Supervised import LazyRegressor
from sklearn import datasets
from sklearn.utils import shuffle
import numpy as np
#boston = datasets.load_boston()
#X, y = shuffle(boston.data, boston.target, random_state=13)
#X = X.astype(np.float32)
#offset = int(X.shape[0] * 0.9)
#X_train, y_train = X[:offset], y[:offset]
#X_test, y_test = X[offset:], y[offset:]
reg = LazyRegressor(verbose=0,ignore_warnings=False, custom_metric=None )
models,predictions = reg.fit(X_train, X_test, Y_train, Y_test)

 74%|███████▍  | 31/42 [01:03<00:15,  1.42s/it]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [01:19<00:00,  1.89s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003531 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 577
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 71
[LightGBM] [Info] Start training from score 17.602560





In [10]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
XGBRegressor,1.0,1.0,1.97,0.36
MLPRegressor,1.0,1.0,2.07,10.65
RandomForestRegressor,0.99,0.99,2.49,7.14
HistGradientBoostingRegressor,0.99,0.99,2.54,0.83
LGBMRegressor,0.99,0.99,2.55,0.3
ExtraTreesRegressor,0.99,0.99,2.62,8.27
BaggingRegressor,0.99,0.99,2.65,0.73
ExtraTreeRegressor,0.99,0.99,2.99,0.16
DecisionTreeRegressor,0.99,0.99,3.31,0.17
PoissonRegressor,0.99,0.99,3.53,0.3
