# Import Library

In this part of the code, we are importing various libraries and modules that we will use to build, train, and evaluate our machine learning models. Think of these libraries as tools in a toolbox that help us perform specific tasks more easily. Here’s a brief explanation of each import statement:

```python
import pandas as pd
import numpy as np
```

- **pandas (`pd`)**: This library is used for data manipulation and analysis. It allows us to easily read, process, and analyze data stored in various formats, such as CSV files.
- **numpy (`np`)**: This library is used for numerical computing. It provides support for arrays (lists of numbers) and mathematical functions to operate on these arrays.

```python
from sklearn... import ...
```

- **sklearn**: This library is used for machine learning tasks such as classification, regression, clustering, and more. It provides a wide range of tools and algorithms to build and evaluate machine learning models.

```python
from xgboost import XGBRegressor
```

- **XGBRegressor**: This is an optimized version of gradient boosting that includes additional features and optimizations. It is known for its high performance and efficiency, especially on larger datasets.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

# Reading Dataset

In this part of the code, we are reading the dataset that contains the training and testing data for our machine learning model. The dataset is stored in a CSV (Comma-Separated Values) file, which is a common format for storing tabular data. We use the `pd.read_csv()` function from the `pandas` library to read the dataset into a DataFrame, which is a two-dimensional data structure that resembles a table.

In [2]:
data = pd.read_csv("Template_for_python S04.csv")
data = data.dropna(how='all')

# Define Features and Target

In this part of the code, we are defining the features (input variables) and the target (output variable) that will be used to train our machine learning model. The features are the columns in the dataset that will be used to make predictions, while the target is the column that contains the values we want to predict.

In [3]:
features = ['clim_zone', 'build_orient', 'build_type',
            'tot_floor_area', 'fac_glaz_uval', 'fac_glaz_shgc', 'wall_type', 'wall_uval',
            'floor_finish', 'floor_uval', 'roof_type', 'roop_uval', 'infilt_type', 'infilt_ach',
            'epd', 'lpd', 'hvac_type', 'pv_pe', 'pv_po', 'pv_pt', 'pv_pa',
            'res_hedc', 'res_cedc']
target = 'res_eui'

# Preprocessing Non-Numerical Features

Not all features in the dataset are numerical. Some features may be categorical or text-based, which need to be converted into numerical values before they can be used in a machine learning model. In this part of the code, we are preprocessing non-numerical features by converting them into numerical values using techniques such as one-hot encoding.

## About One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a numerical format that can be used in machine learning models. It works by creating a binary column for each category in the categorical variable, where each column indicates the presence or absence of that category. This allows the model to learn the relationship between the categories and the target variable.

### One-Hot Encoding Explained

One-hot encoding is a technique used to convert categorical variables into a numerical format that machine learning models can understand. When you have a column with categorical data, such as colors (red, green, blue), one-hot encoding creates new binary (0 or 1) columns for each category. This way, the data can be easily used by machine learning algorithms.

#### Example: One-Hot Encoding

Let's say we have a dataset of building types with a column called `build_type`. This column has three categories: "Residential", "Commercial", and "Industrial".

Here’s what the original data might look like:

| build_type  |
|-------------|
| Residential |
| Commercial  |
| Industrial  |
| Residential |
| Commercial  |

Using one-hot encoding, we will convert this single column into three new columns, one for each category. Each column will contain binary values indicating whether the original `build_type` was that category.

| build_type_Residential | build_type_Commercial | build_type_Industrial |
|------------------------|-----------------------|-----------------------|
| 1                      | 0                     | 0                     |
| 0                      | 1                     | 0                     |
| 0                      | 0                     | 1                     |
| 1                      | 0                     | 0                     |
| 0                      | 1                     | 0                     |

In [4]:
categorical_features = ['clim_zone', 'build_type', 'wall_type', 'floor_finish', 'roof_type', 'infilt_type', 'hvac_type']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ], remainder='passthrough')

# Splitting Data Into Training and Testing

Why we split the data into training and testing sets?

When building a machine learning model, it is important to evaluate its performance on data that it has not seen before. This helps us understand how well the model generalizes to new, unseen data. To achieve this, we split the dataset into two parts: a training set and a testing set.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=2024)

# Define Model

In this experiment, I tried to simultaneously use five models: LinearRegression, XGBoost, RandomForest, GradientBoosting, and Support Vector Machine. The model that performs best will be used to make predictions on the test set.

In [6]:
model_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

model_xgb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=2024))
])

model_rfr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=2024))
])

model_gbr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(n_estimators=100, random_state=2024))
])

model_svr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel='rbf', C=1.0, epsilon=0.1))
])

# Training

In this part of the code, we are training the machine learning model using the training data. The model learns the relationship between the input features and the target variable by adjusting its internal parameters based on the training data. The goal is to minimize the error between the predicted values and the actual values in the training set.

In [7]:
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)

model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)

model_rfr.fit(X_train, y_train)
y_pred_rfr = model_rfr.predict(X_test)

model_gbr.fit(X_train, y_train)
y_pred_gbr = model_gbr.predict(X_test)

model_svr.fit(X_train, y_train)
y_pred_svr = model_svr.predict(X_test)

# Evaluate Model

We use three metrics to evaluate the performance of the model:
* R2 Score: It is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. The R2 score ranges from 0 to 1, where 1 indicates a perfect fit and 0 indicates no relationship between the independent and dependent variables.
* RMSE (Root Mean Squared Error): It is a measure of the differences between values predicted by a model and the values observed. It is the square root of the average of the squared differences between the predicted and actual values.
* MAE (Mean Absolute Error): It is a measure of errors between paired observations expressing the same phenomenon. It is the average of the absolute differences between the predicted and actual values.


In [8]:
r2_lr = r2_score(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr = np.mean(np.abs(y_test - y_pred_lr))
r2_xgb = r2_score(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
mae_xgb = np.mean(np.abs(y_test - y_pred_xgb))
r2_rfr = r2_score(y_test, y_pred_rfr)
rmse_rfr = np.sqrt(mean_squared_error(y_test, y_pred_rfr))
mae_rfr = np.mean(np.abs(y_test - y_pred_rfr))
r2_gbr = r2_score(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mean_squared_error(y_test, y_pred_gbr))
mae_gbr = np.mean(np.abs(y_test - y_pred_gbr))
r2_svr = r2_score(y_test, y_pred_svr)
rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_svr))
mae_svr = np.mean(np.abs(y_test - y_pred_svr))

print(f'R-squared LR: {r2_lr}')
print(f'RMSE LR: {rmse_lr}')
print(f'MAE LR: {mae_lr}\n')
print(f'R-squared XGB: {r2_xgb}')
print(f'RMSE XGB: {rmse_xgb}')
print(f'MAE XGB: {mae_xgb}\n')
print(f'R-squared RFR: {r2_rfr}')
print(f'RMSE RFR: {rmse_rfr}')
print(f'MAE RFR: {mae_rfr}\n')
print(f'R-squared GBR: {r2_gbr}')
print(f'RMSE GBR: {rmse_gbr}')
print(f'MAE GBR: {mae_gbr}\n')
print(f'R-squared SVR: {r2_svr}')
print(f'RMSE SVR: {rmse_svr}')
print(f'MAE SVR: {mae_svr}\n')

R-squared LR: -1.2379412783092114
RMSE LR: 42.23415453242447
MAE LR: 8.107548112454621

R-squared XGB: 0.975441954274327
RMSE XGB: 4.424216418067638
MAE XGB: 3.0005763924640156

R-squared RFR: 0.9543418718411509
RMSE RFR: 6.032516419192416
MAE RFR: 3.7163043478260867

R-squared GBR: 0.9905612788203739
RMSE GBR: 2.7428119934884907
MAE GBR: 1.8863877913816305

R-squared SVR: 0.07855218632395455
RMSE SVR: 27.10034584366519
MAE SVR: 16.35101037702196



Based on the results of the evaluation metrics, we can determine how well the model is performing and make adjustments as needed to improve its performance. From the five models, LinearRegression performed the best, with the highest R2 score and lowest RMSE and MAE values.

# Check Feature Importance
(For Linear Regression Model)

We want to check, from all of the features, which features are the most important in predicting the target variable. This can help us understand which features have the most impact on the target variable and how they contribute to the model's predictions.

If the coefficients of the features are positive, it means that the feature has a positive impact on the target variable. If the coefficients are negative, it means that the feature has a negative impact (inverse) on the target variable.

We will sort the features based on their coefficients to identify the most important features in the model.

For the categorical features (with prefix `cat`), there will be a number of categories in the end of the feature. E.g.: `cat_hvac_type_0.0, means that the feature is `hvac_type` with category `0.0`, in this case is `Fan Coil Units and Central Plant`.

In [12]:
coefficients = model_lr.named_steps['regressor'].coef_
feature_names = model_lr.named_steps['preprocessor'].get_feature_names_out()
feature_importance = dict(zip(feature_names, coefficients))

sorted_feature_importance = dict(sorted(feature_importance.items(), key=lambda item: abs(item[1]), reverse=True))

print("\nFeatures affecting res_eui (sorted by importance):")
for feature, coefficient in sorted_feature_importance.items():
    print(f'{feature}: {coefficient}')


Features affecting res_eui (sorted by importance):
cat__clim_zone_7: 7291149589.899384
cat__clim_zone_6A: 7291149500.67927
cat__clim_zone_5A: 7291149483.058145
cat__clim_zone_1A: 7291149473.050152
cat__clim_zone_0A: 7291149471.1863575
cat__clim_zone_3A: 7291149432.756974
cat__clim_zone_2A: 7291149429.235195
cat__clim_zone_4A: 7291149424.924022
remainder__fac_glaz_shgc: -20.779385890152607
remainder__infilt_ach: 10.180511458910336
cat__hvac_type_0.0: 5.737593489896991
cat__hvac_type_1.0: -5.737487130118826
remainder__floor_uval: 3.816983844884436
remainder__pv_pe: -3.7699506650868124
cat__floor_finish_0.0: -2.503824107191431
remainder__roop_uval: 2.337979738717343
remainder__fac_glaz_uval: 1.9584975011792722
remainder__lpd: 1.7526194870627188
remainder__epd: 1.6132309276214314
cat__floor_finish_1.0: 1.405026520687813
cat__floor_finish_2.0: 1.0987776807509166
cat__wall_type_0.0: -0.91877364032667
cat__wall_type_2.0: 0.8213032534082229
cat__roof_type_0.0: -0.570305228711857
cat__roof_typ

(For Gradient Boosting Regression Model)

In [11]:
feature_importances_gbr = model_gbr.named_steps['regressor'].feature_importances_
feature_names_gbr = model_gbr.named_steps['preprocessor'].get_feature_names_out()
feature_importance_gbr = dict(zip(feature_names_gbr, feature_importances_gbr))

sorted_feature_importance_gbr = dict(sorted(feature_importance_gbr.items(), key=lambda item: item[1], reverse=True))

print("\nFeatures affecting res_eui (Gradient Boosting Regressor). Sorted by importance")
for feature, importance in sorted_feature_importance_gbr.items():
    print(f'{feature}: {importance}')


Features affecting res_eui (Gradient Boosting Regressor). Sorted by importance
remainder__pv_pa: 0.5695905170936769
remainder__res_hedc: 0.18472917478490003
cat__clim_zone_7: 0.1194466428518721
remainder__res_cedc: 0.023711789699299095
remainder__tot_floor_area: 0.022852491829502566
cat__clim_zone_6A: 0.01858816820490199
remainder__pv_pe: 0.015908643827879165
cat__clim_zone_4A: 0.012727590462598127
remainder__pv_po: 0.00944859235705199
remainder__epd: 0.004222788575979729
cat__clim_zone_5A: 0.0038962808340912594
cat__floor_finish_0.0: 0.003835188628237253
remainder__floor_uval: 0.0024581858679852808
remainder__lpd: 0.002160910369521605
remainder__fac_glaz_uval: 0.0014637582245279849
cat__hvac_type_0.0: 0.0010748801839394852
remainder__roop_uval: 0.000852680893810155
remainder__infilt_ach: 0.0008018017707504432
remainder__pv_pt: 0.0007026500362272205
cat__hvac_type_1.0: 0.0006427481135194974
cat__roof_type_0.0: 0.00037732459818478703
cat__roof_type_1.0: 0.00018534074496415008
cat__clim