<a href="https://colab.research.google.com/github/Kakumanu-Harshitha/Energy-Efficient-in-Buildings/blob/main/ENB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Energy Efficiency in Buildings**
The ENB dataset focuses on predicting energy consumption in buildings, specifically in terms of heating load and cooling load. It is a valuable resource for studying building energy efficiency, as it includes various features like the building’s orientation, the size of windows and walls, and temperature factors.






# Steps to build the model

 1.Import libraries

 2.Import Dataset

 3.Data Exploration

 4.Data preprocessing

   

   *   Finding and filling missing values
   *Feature and Target Variable Separation

   *   Normalization
   *   splitting into train and test data


 5.Algorithm Evaluation and Comparison

      a.Support Vector Regression

      b.Linear Regression

      c.Desicion Tree Regression

      d.Random Forest Regression

      e.Gradient Boost Regression

      f.Polynomial Regression **
6.Model Performance Metrics Overview

7.Best Algorithm Selection


    



# 1.Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# 2.Import Dataset
Reference from kaggle: https://www.kaggle.com/datasets/ahmettademir/enb2012



In [None]:
df = pd.read_csv("ENB2012_data.csv")
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [None]:
df.shape

(768, 10)

To find the shape of the dataset

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    int64  
 6   X7      768 non-null    float64
 7   X8      768 non-null    int64  
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB


 Info() is used to understand the structure of the dataset, detect data types, and check for missing values.

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X1,768.0,0.764167,0.105777,0.62,0.6825,0.75,0.83,0.98
X2,768.0,671.708333,88.086116,514.5,606.375,673.75,741.125,808.5
X3,768.0,318.5,43.626481,245.0,294.0,318.5,343.0,416.5
X4,768.0,176.604167,45.16595,110.25,140.875,183.75,220.5,220.5
X5,768.0,5.25,1.75114,3.5,3.5,5.25,7.0,7.0
X6,768.0,3.5,1.118763,2.0,2.75,3.5,4.25,5.0
X7,768.0,0.234375,0.133221,0.0,0.1,0.25,0.4,0.4
X8,768.0,2.8125,1.55096,0.0,1.75,3.0,4.0,5.0
Y1,768.0,22.307201,10.090196,6.01,12.9925,18.95,31.6675,43.1
Y2,768.0,24.58776,9.513306,10.9,15.62,22.08,33.1325,48.03


Let's describe the dataset.It is commonly used to get an overview of the data, helping identify key characteristics like central tendency, spread, and the distribution of values.

# 3.Data Preprocessing

# Finding Missing values

In [None]:
df.isnull().sum()

Unnamed: 0,0
X1,0
X2,0
X3,0
X4,0
X5,0
X6,0
X7,0
X8,0
Y1,0
Y2,0


Here there are no missing values. so we need not to do filling  of missing values.

In [None]:
df.duplicated().sum()

0

df.duplicated().sum() is commonly used to check for duplicate entries in the dataset, which can affect data analysis and model performance.


# Feature and Target Variable Separation


In [None]:
x=df.drop(['Y1','Y2'],axis=1)
y1=df['Y1']
y2=df['Y2']
print(x)
print(y1)
print(y2)

       X1     X2     X3      X4   X5  X6   X7  X8
0    0.98  514.5  294.0  110.25  7.0   2  0.0   0
1    0.98  514.5  294.0  110.25  7.0   3  0.0   0
2    0.98  514.5  294.0  110.25  7.0   4  0.0   0
3    0.98  514.5  294.0  110.25  7.0   5  0.0   0
4    0.90  563.5  318.5  122.50  7.0   2  0.0   0
..    ...    ...    ...     ...  ...  ..  ...  ..
763  0.64  784.0  343.0  220.50  3.5   5  0.4   5
764  0.62  808.5  367.5  220.50  3.5   2  0.4   5
765  0.62  808.5  367.5  220.50  3.5   3  0.4   5
766  0.62  808.5  367.5  220.50  3.5   4  0.4   5
767  0.62  808.5  367.5  220.50  3.5   5  0.4   5

[768 rows x 8 columns]
0      15.55
1      15.55
2      15.55
3      15.55
4      20.84
       ...  
763    17.88
764    16.54
765    16.44
766    16.48
767    16.64
Name: Y1, Length: 768, dtype: float64
0      21.33
1      21.33
2      21.33
3      21.33
4      28.28
       ...  
763    21.40
764    16.88
765    17.11
766    16.61
767    16.03
Name: Y2, Length: 768, dtype: float64


lets divide the feature variables and target variables.Consider input variables as x. Here we need to find two targets seperatily by dividing them into y1 and y2.

# Normalization

In [None]:
from sklearn.preprocessing import  MinMaxScaler
m_x = MinMaxScaler()
m_y1 = MinMaxScaler()
m_y2 = MinMaxScaler()
x_scaled = m_x.fit_transform(x)
y1_scaled = m_y1.fit_transform(y1.values.reshape(-1, 1))
y2_scaled = m_y2.fit_transform(y2.values.reshape(-1, 1))



Normalization is a technique used in data preprocessing to scale the features (or variables) of a dataset to a similar range, usually between 0 and 1.This helps to increase model performance.

# Splitting into train and test data

Now we have to split the dataset into train and test data.The size of test data is 20% of dataset.Remaining data is training data.
we have to split this because, we train the training data and evaluate the model performance on test data.This ensures that the model's performance can be evaluated on unseen data, helping to prevent overfitting.

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train_1,y_test_1=train_test_split(x_scaled,y1_scaled,test_size=0.2,random_state=42)
x_train,x_test,y_train_2,y_test_2=train_test_split(x_scaled,y2_scaled,test_size=0.2,random_state=42)

let's find the shape

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train_1.shape)
print(y_test_1.shape)
print(x_train.shape)
print(x_test.shape)
print(y_train_2.shape)
print(y_test_2.shape)


(614, 8)
(154, 8)
(614, 1)
(154, 1)
(614, 8)
(154, 8)
(614, 1)
(154, 1)


# 5.Algorithm Evaluation and Comparison

Evaluating Model Performance Using Regression Metrics

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2, model_name="Model"):
    mae_1 = mean_absolute_error(y_test_1, y_pred_1)
    mse_1 = mean_squared_error(y_test_1, y_pred_1)
    rmse_1 = np.sqrt(mse_1)
    r2_1 = r2_score(y_test_1, y_pred_1)
    mae_2 = mean_absolute_error(y_test_2, y_pred_2)
    mse_2 = mean_squared_error(y_test_2, y_pred_2)
    rmse_2 = np.sqrt(mse_2)
    r2_2 = r2_score(y_test_2, y_pred_2)
    print(f"\n{model_name} Performance Metrics:")
    print(f"Mean Absolute Error (MAE): {mae_1}")
    print(f"Mean Squared Error (MSE): {mse_1}")
    print(f"Root Mean Squared Error (RMSE): {rmse_1}")
    print(f"R-squared (R2): {r2_1}")
    print(f"\n{model_name} Performance Metrics:")
    print(f"Mean Absolute Error (MAE): {mae_2}")
    print(f"Mean Squared Error (MSE): {mse_2}")
    print(f"Root Mean Squared Error (RMSE): {rmse_2}")
    print(f"R-squared (R2): {r2_2}")

# **a**.***Support*** ***Vector*** ***Regressor***

In [None]:
from sklearn.svm import SVR
svr_1=SVR()
svr_1.fit(x_train,y_train_1)
y_pred_1=svr_1.predict(x_test)

svr_2=SVR()
svr_2.fit(x_train,y_train_2)
y_pred_2=svr_2.predict(x_test)
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"SVR")


SVR Performance Metrics:
Mean Absolute Error (MAE): 0.0623122925390864
Mean Squared Error (MSE): 0.005126974858004193
Root Mean Squared Error (RMSE): 0.07160289699449453
R-squared (R2): 0.9323331968275725

SVR Performance Metrics:
Mean Absolute Error (MAE): 0.06454294564909405
Mean Squared Error (MSE): 0.006272713513616545
Root Mean Squared Error (RMSE): 0.07920046409975477
R-squared (R2): 0.9066689812221665


# ***b.Linear Regression***

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train_1)
y_pred_1 = model.predict(x_test)
model = LinearRegression()
model.fit(x_train, y_train_2)
y_pred_2 = model.predict(x_test)
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Linear Regression")



Linear Regression Performance Metrics:
Mean Absolute Error (MAE): 0.05883114106572995
Mean Squared Error (MSE): 0.006653630955899004
Root Mean Squared Error (RMSE): 0.08156979193242438
R-squared (R2): 0.9121840951546909

Linear Regression Performance Metrics:
Mean Absolute Error (MAE): 0.05912456801820292
Mean Squared Error (MSE): 0.007176238825425381
Root Mean Squared Error (RMSE): 0.08471268397014335
R-squared (R2): 0.8932255268607286


# ***c.Decision Tree Regressor***

In [None]:
from sklearn.tree import DecisionTreeRegressor
decision_tree_model = DecisionTreeRegressor(random_state=42)
decision_tree_model.fit(x_train, y_train_1)
y_pred_1= decision_tree_model.predict(x_test)
decision_tree_model = DecisionTreeRegressor(random_state=42)
decision_tree_model.fit(x_train, y_train_2)
y_pred_2 = decision_tree_model.predict(x_test)
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Decision Tree Regressor")



Decision Tree Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.011609178096101796
Mean Squared Error (MSE): 0.00028611316439390675
Root Mean Squared Error (RMSE): 0.016914879969834453
R-squared (R2): 0.9962238232649303

Decision Tree Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.029954424783404054
Mean Squared Error (MSE): 0.00286291914428373
Root Mean Squared Error (RMSE): 0.053506253319436696
R-squared (R2): 0.9574029389618163


# ***d.Random Forest Regression***

In [None]:
from sklearn.ensemble import RandomForestRegressor
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(x_train, y_train_1)
y_pred_1 = random_forest_model.predict(x_test)
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(x_train, y_train_2)
y_pred_2 = random_forest_model.predict(x_test)
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Random Forest Regressor")



Random Forest Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.009656206559684563
Mean Squared Error (MSE): 0.00017862449133652984
Root Mean Squared Error (RMSE): 0.013365047375020032
R-squared (R2): 0.9976424795065703

Random Forest Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.02880325707150378
Mean Squared Error (MSE): 0.002179824306375173
Root Mean Squared Error (RMSE): 0.04668858860980028
R-squared (R2): 0.9675666323945692


# ***e.Gradient Boosting Regression***

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gradient_boosting_model.fit(x_train, y_train_1)
y_pred_1 = gradient_boosting_model.predict(x_test)
gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gradient_boosting_model.fit(x_train, y_train_2)
y_pred_2 = gradient_boosting_model.predict(x_test)
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Gradient Boosting Regressor")


Gradient Boosting Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.010411413117876495
Mean Squared Error (MSE): 0.00019286047533180969
Root Mean Squared Error (RMSE): 0.013887421478871075
R-squared (R2): 0.9974545902436708

Gradient Boosting Regressor Performance Metrics:
Mean Absolute Error (MAE): 0.028475210883911525
Mean Squared Error (MSE): 0.0016635887803879968
Root Mean Squared Error (RMSE): 0.04078711537223485
R-squared (R2): 0.9752476443625325


# ***f.Polynomial Regression***

In this we considered **degree=2**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
degree=2
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_1)
y_pred_1 = linear_model.predict(x_test_poly)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_2)
y_pred_2 = linear_model.predict(x_test_poly)

In [None]:
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Polynomial Regression_deg=2")


Polynomial Regression_deg=2 Performance Metrics:
Mean Absolute Error (MAE): 0.01629010910129377
Mean Squared Error (MSE): 0.000468673337440852
Root Mean Squared Error (RMSE): 0.021648864576250922
R-squared (R2): 0.9938143588850912

Polynomial Regression_deg=2 Performance Metrics:
Mean Absolute Error (MAE): 0.03207287971579901
Mean Squared Error (MSE): 0.002160853221150482
Root Mean Squared Error (RMSE): 0.04648497844627318
R-squared (R2): 0.967848901098137


In this we considered **degree=3**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
degree=3
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_1)
y_pred_1 = linear_model.predict(x_test_poly)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_2)
y_pred_2 = linear_model.predict(x_test_poly)

In [None]:
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Polynomial Regression_deg=3")


Polynomial Regression_deg=3 Performance Metrics:
Mean Absolute Error (MAE): 0.011074886285206366
Mean Squared Error (MSE): 0.00021333372444399837
Root Mean Squared Error (RMSE): 0.014605948255556651
R-squared (R2): 0.9971843803525863

Polynomial Regression_deg=3 Performance Metrics:
Mean Absolute Error (MAE): 0.03217514714919718
Mean Squared Error (MSE): 0.0020143539396450007
Root Mean Squared Error (RMSE): 0.044881554559139336
R-squared (R2): 0.9700286478956668


In this we considered **degree**=**4**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
degree=4
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_1)
y_pred_1 = linear_model.predict(x_test_poly)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_2)
y_pred_2 = linear_model.predict(x_test_poly)

In [None]:
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Polynomial Regression_deg=4")


Polynomial Regression_deg=4 Performance Metrics:
Mean Absolute Error (MAE): 0.0091752666492968
Mean Squared Error (MSE): 0.00014934332155129936
Root Mean Squared Error (RMSE): 0.012220610522854386
R-squared (R2): 0.9980289380337507

Polynomial Regression_deg=4 Performance Metrics:
Mean Absolute Error (MAE): 0.025553081726663648
Mean Squared Error (MSE): 0.0015554118319813202
Root Mean Squared Error (RMSE): 0.03943870981638877
R-squared (R2): 0.9768571973544165


In this we considered **degree = 5**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
degree=5
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_1)
y_pred_1 = linear_model.predict(x_test_poly)
linear_model = LinearRegression()
linear_model.fit(x_train_poly, y_train_2)
y_pred_2 = linear_model.predict(x_test_poly)

In [None]:
evaluate_regression(y_test_1, y_pred_1,y_test_2,y_pred_2,"Polynomial Regression_deg=5")



Polynomial Regression_deg=5 Performance Metrics:
Mean Absolute Error (MAE): 0.00949921557934359
Mean Squared Error (MSE): 0.00017381726839564075
Root Mean Squared Error (RMSE): 0.013183977715228464
R-squared (R2): 0.9977059261622603

Polynomial Regression_deg=5 Performance Metrics:
Mean Absolute Error (MAE): 0.030493040475114298
Mean Squared Error (MSE): 0.0019307801193079227
Root Mean Squared Error (RMSE): 0.043940643137167695
R-squared (R2): 0.9712721336340611


# 6.Model Performance Metrics Overview

let's create a table that contains all metrics values. This will be easy to find the best model.

In [None]:
from tabulate import tabulate
#  the models and their corresponding metrics
data = {
    "Model": [
        "SVR_1", "SVR_2", "Linear Regression_1", "Linear Regression_2",
        "Decision Tree Regressor_1", "Decision Tree Regressor_2",
        "Random Forest Regressor_1", "Random Forest Regressor_2",
        "Gradient Boosting Regressor_1", "Gradient Boosting Regressor_2",
        "Polynomial Regression_deg=2_1", "Polynomial Regression_deg=2_2",
        "Polynomial Regression_deg=3_1", "Polynomial Regression_deg=3_2",
        "Polynomial Regression_deg=4_1", "Polynomial Regression_deg=4_2",
        "Polynomial Regression_deg=5_1", "Polynomial Regression_deg=5_2"
    ],
    "MAE": [
        0.0623122925390864, 0.06454294564909405, 0.05883114106572995, 0.05912456801820292,
        0.011609178096101796, 0.029954424783404054, 0.009656206559684563, 0.02880325707150378,
        0.010411413117876495, 0.028475210883911525, 0.01629010910129377, 0.03207287971579901,
        0.011074886285206366, 0.03217514714919718, 0.0091752666492968, 0.025553081726663648,
        0.00949921557934359, 0.030493040475114298
    ],
    "MSE": [
        0.005126974858004193, 0.006272713513616545, 0.006653630955899004, 0.007176238825425381,
        0.00028611316439390675, 0.00286291914428373, 0.00017862449133652984, 0.002179824306375173,
        0.00019286047533180969, 0.0016635887803879968, 0.000468673337440852, 0.002160853221150482,
        0.00021333372444399837, 0.0020143539396450007, 0.00014934332155129936, 0.0015554118319813202,
        0.00017381726839564075, 0.0019307801193079227
    ],
    "RMSE": [
        0.07160289699449453, 0.07920046409975477, 0.08156979193242438, 0.08471268397014335,
        0.016914879969834453, 0.053506253319436696, 0.013365047375020032, 0.04668858860980028,
        0.013887421478871075, 0.04078711537223485, 0.021648864576250922, 0.04648497844627318,
        0.014605948255556651, 0.044881554559139336, 0.012220610522854386, 0.03943870981638877,
        0.013183977715228464, 0.043940643137167695
    ],
    "R-squared (R2)": [
        0.9323331968275725, 0.9066689812221665, 0.9121840951546909, 0.8932255268607286,
        0.9962238232649303, 0.9574029389618163, 0.9976424795065703, 0.9675666323945692,
        0.9974545902436708, 0.9752476443625325, 0.9938143588850912, 0.967848901098137,
        0.9971843803525863, 0.9700286478956668, 0.9980289380337507, 0.9768571973544165,
        0.9977059261622603, 0.9712721336340611
    ]
}
df = pd.DataFrame(data)
print(tabulate(df, headers='keys', tablefmt='fancy_grid', showindex=False))


╒═══════════════════════════════╤════════════╤═════════════╤═══════════╤══════════════════╕
│ Model                         │        MAE │         MSE │      RMSE │   R-squared (R2) │
╞═══════════════════════════════╪════════════╪═════════════╪═══════════╪══════════════════╡
│ SVR_1                         │ 0.0623123  │ 0.00512697  │ 0.0716029 │         0.932333 │
├───────────────────────────────┼────────────┼─────────────┼───────────┼──────────────────┤
│ SVR_2                         │ 0.0645429  │ 0.00627271  │ 0.0792005 │         0.906669 │
├───────────────────────────────┼────────────┼─────────────┼───────────┼──────────────────┤
│ Linear Regression_1           │ 0.0588311  │ 0.00665363  │ 0.0815698 │         0.912184 │
├───────────────────────────────┼────────────┼─────────────┼───────────┼──────────────────┤
│ Linear Regression_2           │ 0.0591246  │ 0.00717624  │ 0.0847127 │         0.893226 │
├───────────────────────────────┼────────────┼─────────────┼───────────┼────────

From this table we found that polynomial regression with degree 4 is the best model.And then Random Forest Regression is also have similar values.





# 7.Best Algorithm Selection

The best model for Energy Efficency in Buildings Dataset is **Polynomial Regression of Degree 4**

The evaluation metrics of Polynomial Regression of Degree 4 is
Polynomial Regression_deg=4 Performance Metrics:

Mean Absolute Error (MAE): 0.0091752666492968

Mean Squared Error (MSE): 0.00014934332155129936

Root Mean Squared Error (RMSE): 0.012220610522854386

R-squared (R2): 0.9980289380337507

Polynomial Regression_deg=4 Performance Metrics:

Mean Absolute Error (MAE): 0.025553081726663648

Mean Squared Error (MSE): 0.0015554118319813202

Root Mean Squared Error (RMSE): 0.03943870981638877

R-squared (R2): 0.9768571973544165



# Finally,we have build the best score. ⚡