# Import Library

In this part of the code, we are importing various libraries and modules that we will use to build, train, and evaluate our machine learning models. Think of these libraries as tools in a toolbox that help us perform specific tasks more easily. Here’s a brief explanation of each import statement:

```python
import pandas as pd
import numpy as np
```

- **pandas (`pd`)**: This library is used for data manipulation and analysis. It allows us to easily read, process, and analyze data stored in various formats, such as CSV files.
- **numpy (`np`)**: This library is used for numerical computing. It provides support for arrays (lists of numbers) and mathematical functions to operate on these arrays.

```python
from sklearn... import ...
```

- **sklearn**: This library is used for machine learning tasks such as classification, regression, clustering, and more. It provides a wide range of tools and algorithms to build and evaluate machine learning models.

```python
from xgboost import XGBRegressor
```

- **XGBRegressor**: This is an optimized version of gradient boosting that includes additional features and optimizations. It is known for its high performance and efficiency, especially on larger datasets.

In [14]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import permutation_importance

from sklearn.svm import SVR
from xgboost import XGBRegressor

# Reading Dataset

In this part of the code, we are reading the dataset that contains the training and testing data for our machine learning model. The dataset is stored in a CSV (Comma-Separated Values) file, which is a common format for storing tabular data. We use the `pd.read_csv()` function from the `pandas` library to read the dataset into a DataFrame, which is a two-dimensional data structure that resembles a table.

In [2]:
data = pd.read_csv("Template_for_python S04.csv")
data = data.dropna(how='all')

# Define Features and Target

In this part of the code, we are defining the features (input variables) and the target (output variable) that will be used to train our machine learning model. The features are the columns in the dataset that will be used to make predictions, while the target is the column that contains the values we want to predict.

In [3]:
features = ['clim_zone', 'build_orient',
            'tot_floor_area', 'fac_glaz_uval', 'fac_glaz_shgc', 'wall_type', 'wall_uval',
            'floor_finish', 'floor_uval', 'roof_type', 'roof_uval', 'infilt_ach',
            'epd', 'lpd', 'hvac_type', 'pv_pe', 'pv_po', 'pv_pt', 'pv_pa',
            'res_hedc', 'res_cedc']
target = 'res_eui'

In [17]:
print(f"Number of Features: {len(features)}")

Number of Features: 21


# Preprocessing Non-Numerical Features

Not all features in the dataset are numerical. Some features may be categorical or text-based, which need to be converted into numerical values before they can be used in a machine learning model. In this part of the code, we are preprocessing non-numerical features by converting them into numerical values using techniques such as one-hot encoding.

## About One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a numerical format that can be used in machine learning models. It works by creating a binary column for each category in the categorical variable, where each column indicates the presence or absence of that category. This allows the model to learn the relationship between the categories and the target variable.

### One-Hot Encoding Explained

One-hot encoding is a technique used to convert categorical variables into a numerical format that machine learning models can understand. When you have a column with categorical data, such as colors (red, green, blue), one-hot encoding creates new binary (0 or 1) columns for each category. This way, the data can be easily used by machine learning algorithms.

#### Example: One-Hot Encoding

Let's say we have a dataset of building types with a column called `build_type`. This column has three categories: "Residential", "Commercial", and "Industrial".

Here’s what the original data might look like:

| build_type  |
|-------------|
| Residential |
| Commercial  |
| Industrial  |
| Residential |
| Commercial  |

Using one-hot encoding, we will convert this single column into three new columns, one for each category. Each column will contain binary values indicating whether the original `build_type` was that category.

| build_type_Residential | build_type_Commercial | build_type_Industrial |
|------------------------|-----------------------|-----------------------|
| 1                      | 0                     | 0                     |
| 0                      | 1                     | 0                     |
| 0                      | 0                     | 1                     |
| 1                      | 0                     | 0                     |
| 0                      | 1                     | 0                     |

In [4]:
categorical_features = ['clim_zone', 'wall_type', 'floor_finish', 'roof_type', 'hvac_type']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ], remainder='passthrough')

# Splitting Data Into 5 Folds

```python
kf = KFold(n_splits=5, shuffle=True, random_state=2024)
```

This line initializes a K-Fold cross-validator from scikit-learn's model selection module. Let's break down the parameters:

- `n_splits=5`: This sets up 5-fold cross-validation. The data will be divided into 5 equal parts or "folds".
- `shuffle=True`: This parameter tells the cross-validator to shuffle the data before splitting it into folds. This helps to ensure that the order of the data doesn't affect the results.
- `random_state=2024`: This sets a specific random seed for reproducibility. Using the same random state will ensure that the data is shuffled in the same way every time the code is run.

In 5-fold cross-validation:
1. The data is divided into 5 equal subsets or folds.
2. The model is trained on 4 folds and tested on the remaining fold.
3. This process is repeated 5 times, with each fold serving as the test set exactly once.
4. The performance metrics are then averaged across all 5 iterations.

This method helps to get a more robust estimate of the model's performance by using all the data for both training and testing.

In [5]:
kf = KFold(n_splits=5, shuffle=True, random_state=2024)

# Define Model

In this experiment, I tried to simultaneously use five models: LinearRegression, XGBoost, RandomForest, GradientBoosting, and Support Vector Machine. The model that performs best will be used to make predictions on the test set.

In [6]:
models = {
    'LR': Pipeline(steps=[('preprocessor', preprocessor), ('regressor', LinearRegression())]),
    'XGB': Pipeline(steps=[('preprocessor', preprocessor), ('regressor', XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=2024))]),
    'RFR': Pipeline(steps=[('preprocessor', preprocessor), ('regressor', RandomForestRegressor(n_estimators=100, random_state=2024))]),
    'GBR': Pipeline(steps=[('preprocessor', preprocessor), ('regressor', GradientBoostingRegressor(n_estimators=100, random_state=2024))]),
    'SVR': Pipeline(steps=[('preprocessor', preprocessor), ('regressor', SVR(kernel='rbf', C=1.0, epsilon=0.1))])
}

In [7]:
results = {model: {'r2': [], 'rmse': [], 'mae': []} for model in models}

# Training

In this part of the code, we are training the machine learning model using the training data. The model learns the relationship between the input features and the target variable by adjusting its internal parameters based on the training data. The goal is to minimize the error between the predicted values and the actual values in the training set.

In [8]:
for train_index, test_index in kf.split(data[features]):
    X_train, X_test = data[features].iloc[train_index], data[features].iloc[test_index]
    y_train, y_test = data[target].iloc[train_index], data[target].iloc[test_index]
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        results[name]['r2'].append(r2_score(y_test, y_pred))
        results[name]['rmse'].append(np.sqrt(mean_squared_error(y_test, y_pred)))
        results[name]['mae'].append(np.mean(np.abs(y_test - y_pred)))

# Evaluate Model

We use three metrics to evaluate the performance of the model:
* R2 Score: It is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. The R2 score ranges from 0 to 1, where 1 indicates a perfect fit and 0 indicates no relationship between the independent and dependent variables.
* RMSE (Root Mean Squared Error): It is a measure of the differences between values predicted by a model and the values observed. It is the square root of the average of the squared differences between the predicted and actual values.
* MAE (Mean Absolute Error): It is a measure of errors between paired observations expressing the same phenomenon. It is the average of the absolute differences between the predicted and actual values.


In [9]:
for name, metrics in results.items():
    print(f"{name} Results:")
    print(f"Average R-squared: {np.mean(metrics['r2']):.4f} (+/- {np.std(metrics['r2']):.4f})")
    print(f"Average RMSE: {np.mean(metrics['rmse']):.4f} (+/- {np.std(metrics['rmse']):.4f})")
    print(f"Average MAE: {np.mean(metrics['mae']):.4f} (+/- {np.std(metrics['mae']):.4f})")
    print()

LR Results:
Average R-squared: 0.9400 (+/- 0.0128)
Average RMSE: 12.0909 (+/- 1.5077)
Average MAE: 7.6232 (+/- 0.6026)

XGB Results:
Average R-squared: 0.9913 (+/- 0.0028)
Average RMSE: 4.5735 (+/- 0.8450)
Average MAE: 1.9955 (+/- 0.2051)

RFR Results:
Average R-squared: 0.9866 (+/- 0.0025)
Average RMSE: 5.7348 (+/- 0.8360)
Average MAE: 3.0195 (+/- 0.2764)

GBR Results:
Average R-squared: 0.9802 (+/- 0.0033)
Average RMSE: 6.9842 (+/- 1.0328)
Average MAE: 4.3894 (+/- 0.2148)

SVR Results:
Average R-squared: 0.0619 (+/- 0.0438)
Average RMSE: 48.0425 (+/- 3.0118)
Average MAE: 31.6738 (+/- 2.4367)



Based on the results of the evaluation metrics, we can determine how well the model is performing and make adjustments as needed to improve its performance. From the five models, LinearRegression performed the best, with the highest R2 score and lowest RMSE and MAE values.

# Check Feature Importance

We want to check, from all of the features, which features are the most important in predicting the target variable. This can help us understand which features have the most impact on the target variable and how they contribute to the model's predictions.

If the coefficients of the features are positive, it means that the feature has a positive impact on the target variable. If the coefficients are negative, it means that the feature has a negative impact (inverse) on the target variable.

We will sort the features based on their coefficients to identify the most important features in the model.

For the categorical features (with prefix `cat`), there will be a number of categories in the end of the feature. E.g.: `cat_hvac_type_0.0`, means that the feature is `hvac_type` with category `0.0`, in this case is `Fan Coil Units and Central Plant`.

In [15]:
def get_feature_importance(model, model_name, X, y):
    feature_names = model.named_steps['preprocessor'].get_feature_names_out()
    
    if model_name == 'LR':
        importances = model.named_steps['regressor'].coef_
    elif model_name in ['XGB', 'RFR', 'GBR']:
        importances = model.named_steps['regressor'].feature_importances_
    elif model_name == 'SVR':
        # For SVR, we use permutation importance
        perm_importance = permutation_importance(model, X, y, n_repeats=10, random_state=2024)
        importances = perm_importance.importances_mean
    else:
        return None

    feature_importance = dict(zip(feature_names, importances))
    return dict(sorted(feature_importance.items(), key=lambda item: abs(item[1]), reverse=True))

In [16]:
X = data[features]
y = data[target]

for name, model in models.items():
    print(f"\nFeature Importance for {name}:")
    importance = get_feature_importance(model, name, X, y)
    if importance:
        for feature, value in importance.items():
            print(f"{feature}: {value:.4f}")
    print()


Feature Importance for LR:
cat__clim_zone_7: 4528797652.4397
cat__clim_zone_0A: 4528797629.5039
cat__clim_zone_1A: 4528797625.9465
cat__clim_zone_6A: 4528797591.2245
cat__clim_zone_5A: 4528797585.3367
cat__clim_zone_2A: 4528797575.6369
cat__clim_zone_3A: 4528797563.1239
cat__clim_zone_4A: 4528797545.9550
remainder__infilt_ach: 13.1415
cat__hvac_type_Fan Coil Units and Central Plant: 6.2366
cat__hvac_type_VRF Fan Coils: -6.2366
remainder__floor_uval: 4.2216
remainder__pv_pe: -3.6500
cat__floor_finish_Tiles: -3.1313
cat__floor_finish_Hardwood: 2.7539
remainder__roof_uval: 2.4075
remainder__fac_glaz_uval: 2.1727
remainder__lpd: 2.1576
remainder__epd: 2.0070
remainder__wall_uval: 0.7150
cat__wall_type_Brick Plaster: 0.6418
cat__wall_type_Precast Concrete: -0.6290
cat__roof_type_Concrete: -0.5872
cat__roof_type_Metal Deck: 0.5872
cat__floor_finish_Carpet: 0.3773
remainder__res_hedc: 0.1062
remainder__fac_glaz_shgc: -0.0531
remainder__pv_pt: 0.0420
remainder__pv_po: -0.0341
cat__wall_type_C

## Saving the Importance Value as CSV

In [18]:
def get_feature_importance(model, model_name, X, y):
    feature_names = model.named_steps['preprocessor'].get_feature_names_out()
    
    if model_name == 'LR':
        importances = model.named_steps['regressor'].coef_
    elif model_name in ['XGB', 'RFR', 'GBR']:
        importances = model.named_steps['regressor'].feature_importances_
    elif model_name == 'SVR':
        # For SVR, we use permutation importance
        perm_importance = permutation_importance(model, X, y, n_repeats=10, random_state=2024)
        importances = perm_importance.importances_mean
    else:
        return None

    return dict(zip(feature_names, importances))

# Calculate feature importances for all models
X = data[features]
y = data[target]

feature_importances = {}
for name, model in models.items():
    feature_importances[name] = get_feature_importance(model, name, X, y)

# Create a DataFrame from the feature importances
df_importance = pd.DataFrame(feature_importances)

# Sort the DataFrame by the average importance across all models
df_importance['avg_importance'] = df_importance.mean(axis=1)
df_importance = df_importance.sort_values('avg_importance', ascending=False)
df_importance = df_importance.drop('avg_importance', axis=1)

# Rename the index to include the categorical labels
new_index = []
for feature in df_importance.index:
    if feature.startswith('cat__'):
        parts = feature.split('__')
        if len(parts) == 3:
            new_index.append(f"{parts[1]}_{parts[2]}")
        else:
            new_index.append(feature)
    else:
        new_index.append(feature)
df_importance.index = new_index

# Save the DataFrame to a CSV file
df_importance.to_csv('feature_importance_comparison.csv')

print("Feature importance comparison has been saved to 'feature_importance_comparison.csv'")

Feature importance comparison has been saved to 'feature_importance_comparison.csv'
