In [1]:
import pandas as pd
import numpy as np
import plotly.express as plx
from plotly.subplots import make_subplots
import plotly.graph_objects as go

### We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. 
### We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. 
### The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. 
### It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.

### The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.

### X1	Relative Compactness
### X2	Surface Area
### X3	Wall Area
### X4	Roof Area
### X5	Overall Height
### X6	Orientation
### X7	Glazing Area
### X8	Glazing Area Distribution
### y1	Heating Load
### y2	Cooling Load

In [2]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48,16.61


In [3]:
df.isnull().sum()

X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
X7    0
X8    0
Y1    0
Y2    0
dtype: int64

### we already know that, there is no missing values in the dataset from https://archive.ics.uci.edu/dataset/242/energy+efficiency

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    int64  
 6   X7      768 non-null    float64
 7   X8      768 non-null    int64  
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB


In [5]:
df.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,0.764167,671.708333,318.5,176.604167,5.25,3.5,0.234375,2.8125,22.307195,24.58776
std,0.105777,88.086116,43.626481,45.16595,1.75114,1.118763,0.133221,1.55096,10.090204,9.513306
min,0.62,514.5,245.0,110.25,3.5,2.0,0.0,0.0,6.01,10.9
25%,0.6825,606.375,294.0,140.875,3.5,2.75,0.1,1.75,12.9925,15.62
50%,0.75,673.75,318.5,183.75,5.25,3.5,0.25,3.0,18.95,22.08
75%,0.83,741.125,343.0,220.5,7.0,4.25,0.4,4.0,31.6675,33.1325
max,0.98,808.5,416.5,220.5,7.0,5.0,0.4,5.0,43.1,48.03


#### In this dataset, there are two output variables and 8 input variables. 
#### The datatype of output variables (Y1 & Y2) are float.
#### The datatype of 6 input variables(X1,X2,X3,X4,X5,X7) are float and remaining 2 variables(X6,X8) are integer.

# Exploratory Data Analysis

In [6]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

# Insights:

## From the above correlation heatmap we know that,
### 1. Y1 and Y2 are highly positively correlated to each other.
### 2. X5(Overall Height of the building) is highly positively correlated with both the output variables (Y1 & Y2). What literally means is, if overall height of the building increases then heating and cooling load is also increased.
### 3. X4(Area of the Roof of building) is highly negatively correlated with both the output variable (Y1 & Y2). It means that, if roof area of the building increases then heating load and cooling load is decreased.
### 4. X1 and X3 (Relative Compactness and Wall Area) having moderate correlation with output variables(Y1 & Y2).
### 5. X2 (Surface Area) having moderate negative correlation with output variables(Y1 & Y2).
### 6. X3(Wall Area) variable is only important for output variables(Y1 & Y2), not for other 7 variables.
### 7. X1,X2,X4 and X5(Relative Compactness, Surface Area, Roof Area and Overall Height) are highly correlated with each other.
### 8. X7(Glazing Area) having around 0.2 correlation with both the ouput variables(Y1 & Y2) and X8 variable also, but not important for other 6 variables.
### 9. X6 (Orientation) is not important for any of the feature including the output variables (Y1 & Y2). What is Orientation and Why this is not important? Here, Orientation is nothing but how the building is positioned. We got explanation from google ----->  "Orientation is how a building is positioned in relation to the sun's paths in different seasons, as well as to prevailing wind patterns. In passive design, it is also about how living and sleeping areas are designed and positioned, either to take advantage of the sun and wind, or be protected from their effects". Here, the data is in number format 2,3,4,5. It has some meaning, but we don't know what literally is. From the insights we got, we clearly know that, Y1 and Y2 is not based on the Orientation of the building and it is not important.
### 10. X8 is not important for the ouput variables(Y1 & Y2), but having corrleation of 0.2 with variable X7.
### 11. X1 & X2 are highly negatively correlated with each other. 

In [7]:
df.corr()['X6']

X1    4.678592e-17
X2   -3.459372e-17
X3   -2.429499e-17
X4   -5.830058e-17
X5    4.492205e-17
X6    1.000000e+00
X7   -9.406007e-16
X8   -2.549352e-16
Y1   -2.586763e-03
Y2    1.428960e-02
Name: X6, dtype: float64

In [8]:
df['X6'].value_counts()

2    192
3    192
4    192
5    192
Name: X6, dtype: int64

In [9]:
df['X8'].value_counts()

1    144
2    144
3    144
4    144
5    144
0     48
Name: X8, dtype: int64

In [10]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])

In [11]:
plx.box(x = df['X8'],y=df['Y2'],color=df['X8'])

## From the above two graph, we know that 0 belongs to one group and rest belongs to one group.

# Dimensionality Reduction

In [12]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

## 1. From the above heatmap, we know that X6(Orientation) is not important for any of the variables including output variables(Y1 & Y2). So, we can remove X6(Orientation) from the context.

In [13]:
df.drop(['X6'],axis=1,inplace=True)

In [14]:
df

Unnamed: 0,X1,X2,X3,X4,X5,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,0.4,5,16.48,16.61


## 2. From the above heatmap, we know that X1 and X2 are highly negatively correlated with each other. We have remove the feature which having less correlation with respect to output variables. 
## X1 having around 0.63 correlation on output variables Y1 & Y2.
## X2 having around -0.67 correlation on output variables Y1 & Y2.
## Here, we have to remove X1(Relative Compactness)

In [15]:
df.drop(['X1'],axis=1,inplace=True)

In [16]:
df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,5,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,5,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,5,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,5,16.48,16.61


## we can change the value of X8. 0 is represented as 0 and rest of them represented as 1.

In [17]:
df.loc[(df['X8']>0), 'X8'] = 1
df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,1,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,1,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,1,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,1,16.48,16.61


## Now, the values of X8 gets changed.

In [18]:
plx.imshow(df.corr(),height=750,width=750,text_auto=True)

# Insight
### 1. After change in X8 feature, the correlation between X7 and X8 is increased and also with output variable Y1 & Y2.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X2      768 non-null    float64
 1   X3      768 non-null    float64
 2   X4      768 non-null    float64
 3   X5      768 non-null    float64
 4   X7      768 non-null    float64
 5   X8      768 non-null    int64  
 6   Y1      768 non-null    float64
 7   Y2      768 non-null    float64
dtypes: float64(7), int64(1)
memory usage: 48.1 KB


In [20]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])

In [21]:
plx.box(x = df['X8'],y=df['Y1'],color=df['X8'])

## Now, Everything is good in dataset. We can move forward to the model development.

# Model Development 

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import r2_score
from tqdm import tqdm

In [23]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X2','X3','X4','X5','X7','X8'],axis=1)
lr_trn_score,rfr_trn_score,sgd_trn_score,en_trn_score,abr_trn_score,gbr_trn_score,svr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[],[],[],[]
lr_test_score,rfr_test_score,sgd_test_score,en_test_score,abr_test_score,gbr_test_score,svr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    sgd = MultiOutputRegressor(SGDRegressor()).fit(x_train,y_train)
    pred = sgd.predict(x_test)
    pred_trn = sgd.predict(x_train)
    sgd_test_score.append(r2_score(y_test, pred))
    sgd_trn_score.append(r2_score(y_train, pred_trn))
    
    en = ElasticNet().fit(x_train,y_train)
    pred = en.predict(x_test)
    pred_trn = en.predict(x_train)
    en_test_score.append(r2_score(y_test, pred))
    en_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = MultiOutputRegressor(AdaBoostRegressor()).fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = MultiOutputRegressor(GradientBoostingRegressor()).fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
    
    svr = MultiOutputRegressor(SVR()).fit(x_train,y_train)
    pred = svr.predict(x_test)
    pred_trn = svr.predict(x_train)
    svr_test_score.append(r2_score(y_test, pred))
    svr_trn_score.append(r2_score(y_train, pred_trn))
    
    xgb = MultiOutputRegressor(XGBRegressor()).fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0)).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))

100%|██████████| 1000/1000 [30:07<00:00,  1.81s/it] 


# 1. Linear Regression

In [24]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

# 2. SGDRegressor

In [25]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = sgd_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = sgd_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on SGDRegressor')
fig.show()

# 3. ElasticNet

In [26]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = en_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = en_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on ElasticNet Regression')
fig.show()

# 4. AdaBoostRegressor

In [27]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

# 5.GradientBoostingRegressor

In [28]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

# 6. SVR

In [29]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = svr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = svr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Support Vector Regressor')
fig.show()

# 7. XGBRegressor

In [30]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

# 8. CatBoostRegressor

In [31]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

# 9. RandomForestRegressor

In [32]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()

## From the above visualization, we know that Boosting Algorithm predicts better than other algorithms. Both, train and test score(r2_score) is good is Boosting Algorithm around 0.985.

In [33]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X2','X3','X4','X5','X7','X8'],axis=1)
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100)).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9854413978366589 0.9814623383584027


In [34]:
y1_pred,y2_pred = [],[]
for i in range(len(pred)):
    y1_pred.append(pred[i][0])
    y2_pred.append(pred[i][1])

In [35]:
def visulaize_performance_of_the_model(pred, y_test, modelname):
    # Plotting both line & scatter plot in same graph of predicted values to check the performance of the model in visualization.
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=np.arange(0,50), y=np.arange(0,50),
                             mode='lines',
                             name='perfectline'))
    fig.add_trace(go.Scatter(x=pred, y=y_test,
                             mode='markers',
                             name='predictions'))
    fig.update_layout(
        title=f"Performance of {modelname} on Test data",
        xaxis_title="Predicted",
        yaxis_title="Actual",
        font=dict(
            family="Courier New, monospace",
            size=13,
            color="RebeccaPurple"
        )
    )
    fig.show()

In [36]:
visulaize_performance_of_the_model(y1_pred, y_test['Y1'], 'CatBoost regressor')

In [37]:
visulaize_performance_of_the_model(y2_pred, y_test['Y2'], 'CatBoost regressor')

## From the above graph, we know that prediction on Y1 is too good. But on Y2, a bit poor when comapred to Y1. So, we decided to predict Y1 and Y2 individually with 2 models. Let's find the best algorithm for Y1 and Y2.

# 1. Prediction on Y1(Heating Load)

In [38]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y1']
lr_trn_score,rfr_trn_score,abr_trn_score,gbr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[]
lr_test_score,rfr_test_score,abr_test_score,gbr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = AdaBoostRegressor().fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = GradientBoostingRegressor().fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
     
    xgb = XGBRegressor().fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))

100%|██████████| 1000/1000 [14:46<00:00,  1.13it/s] 


## 1. Linear Regression

In [39]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

## 2. AdaBoostRegressor

In [40]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

## 3. GradientBoostingRegressor

In [41]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

## 4. XGBRegressor

In [42]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

## 5. CatBoostRegressor

In [43]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

## 6. RandomForestRegressor

In [44]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()

# 2. Prediction on Y2(Cooling Load)

In [45]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y2']
lr_trn_score,rfr_trn_score,abr_trn_score,gbr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[]
lr_test_score,rfr_test_score,abr_test_score,gbr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = AdaBoostRegressor().fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = GradientBoostingRegressor().fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
     
    xgb = XGBRegressor().fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))

100%|██████████| 1000/1000 [14:05<00:00,  1.18it/s]


## 1. Linear Regression

In [46]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = lr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = lr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on Linear Regression')
fig.show()

## 2. AdaBoostRegressor

In [47]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = abr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = abr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on AdaBoostRegressor')
fig.show()

## 3. GradientBoostingRegressor

In [48]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = gbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = gbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on GradientBoostingRegressor')
fig.show()

## 4. XGBRegressor

In [49]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = xgb_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = xgb_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on XGBRegressor')
fig.show()

## 5. CatBoostRegressor

In [50]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

## 6. RandomForestRegressor

In [51]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = rfr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = rfr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on RandomForestRegressor')
fig.show()

In [52]:
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9944150509001013 0.9829129866383302


In [53]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

In [54]:
temp_df = df.loc[df['Y2'] > 25]
temp_df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
4,563.5,318.5,122.5,7.0,0.0,0,20.84,28.28
5,563.5,318.5,122.5,7.0,0.0,0,21.46,25.38
6,563.5,318.5,122.5,7.0,0.0,0,20.71,25.16
7,563.5,318.5,122.5,7.0,0.0,0,19.68,29.60
8,588.0,294.0,147.0,7.0,0.0,0,19.50,27.30
...,...,...,...,...,...,...,...,...
739,637.0,343.0,147.0,7.0,0.4,1,40.79,44.87
740,661.5,416.5,122.5,7.0,0.4,1,38.82,39.37
741,661.5,416.5,122.5,7.0,0.4,1,39.72,39.80
742,661.5,416.5,122.5,7.0,0.4,1,39.31,37.79


In [55]:
temp_df['X8'].value_counts()

1    354
0     14
Name: X8, dtype: int64

In [56]:
temp_df.loc[temp_df['X8'] == 0]

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
4,563.5,318.5,122.5,7.0,0.0,0,20.84,28.28
5,563.5,318.5,122.5,7.0,0.0,0,21.46,25.38
6,563.5,318.5,122.5,7.0,0.0,0,20.71,25.16
7,563.5,318.5,122.5,7.0,0.0,0,19.68,29.6
8,588.0,294.0,147.0,7.0,0.0,0,19.5,27.3
11,588.0,294.0,147.0,7.0,0.0,0,18.31,27.87
16,637.0,343.0,147.0,7.0,0.0,0,28.52,37.73
17,637.0,343.0,147.0,7.0,0.0,0,29.9,31.27
18,637.0,343.0,147.0,7.0,0.0,0,29.63,30.93
19,637.0,343.0,147.0,7.0,0.0,0,28.75,39.44


## Here, we're going to revert X8 values.  Because the prediction of Y2(Cooling Load) causing a huge error when it tries to predict over the value of 25. 

In [57]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48,16.61


## we're going to remove X1 and X6. The reason is the same as we mentioned earlier in this notebook.

In [58]:
df.drop(['X1','X6'],axis=1,inplace=True)
df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,5,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,5,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,5,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,5,16.48,16.61


In [59]:
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.977063447328585 0.9401598996120876


In [60]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Scenario 1 : With X8 feature and Without Y1 feature.

In [61]:
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9961402649937874 0.9825615016523755


In [62]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Sceanrio 2 : With X8 and Y1 feature.

In [63]:
temp_df = df.drop(['X8'],axis=1)
temp_df

Unnamed: 0,X2,X3,X4,X5,X7,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,20.84,28.28
...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,16.48,16.61


In [64]:
X = temp_df.drop(['Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9939310855310725 0.9892019517369207


In [65]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Scenario 3 : Without X8 and with Y1 feature.

In [66]:
X = temp_df.drop(['Y1','Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9710728626727535 0.9747745043013276


In [67]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Scenario 4 : Without x8 and Y1 feature.

In [68]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
temp_df = df.drop(['X1','X6'],axis=1)
temp_df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,5,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,5,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,5,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,5,16.48,16.61


In [69]:
temp_df.loc[(temp_df['X8'] > 0), 'X8']=1
temp_df

Unnamed: 0,X2,X3,X4,X5,X7,X8,Y1,Y2
0,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
1,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
2,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
3,514.5,294.0,110.25,7.0,0.0,0,15.55,21.33
4,563.5,318.5,122.50,7.0,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...
763,784.0,343.0,220.50,3.5,0.4,1,17.88,21.40
764,808.5,367.5,220.50,3.5,0.4,1,16.54,16.88
765,808.5,367.5,220.50,3.5,0.4,1,16.44,17.11
766,808.5,367.5,220.50,3.5,0.4,1,16.48,16.61


In [70]:
X = temp_df.drop(['Y2'],axis=1)
Y = temp_df['Y2']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9957190858593685 0.9721345451651117


In [71]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Scenario 5 : With modified X8 and Y1 feature

## From the above 5 scenario's, we can pic 2nd or 5th . The reason is, only few points away from center line when compared to other 3 scenario's. With X8 and with Y1 feature, the Y2 predictions are good.

## We can check this for Y1 prediction also. We have to know whether X8 is considered or not.

In [72]:
X = temp_df.drop(['Y1','Y2'],axis=1)
Y = temp_df['Y1']
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(x_train,y_train)
pred = cbr.predict(x_test)
pred_trn = cbr.predict(x_train)
print(r2_score(y_train, pred_trn), r2_score(y_test, pred))

0.9980566490700827 0.9974422531623282


In [73]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## Without X8, Y1 prediction is also good. But it is bit important for X7 feature. having 0.2 coorelation with each other. So, we can include X8 for Y1 Prediction. So, we have to create two seperate models for better predictions. 

# 1 .Prediction of Y1 with the independent variables of X2, X3, X4,X5,X7 and X8.

# 2 .Prediction of Y2 with the independent variables of X2, X3, X4,X5,X7,X8 and dependent feature of Y1.

In [74]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y1','Y2'],axis=1)
Y = df['Y1']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')

0.9999073199670971


In [75]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')

0.9992043779559775


In [76]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['X1','X6','Y1','Y2'],axis=1)
Y = df['Y1']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')

0.9987571086021994


In [77]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['X1','X6','Y2'],axis=1)
Y = df['Y2']
#x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
cbr = CatBoostRegressor(verbose=0, n_estimators=10000,early_stopping_rounds=100).fit(X, Y)
pred = cbr.predict(X)
print(r2_score(Y, pred))
visulaize_performance_of_the_model(pred, Y, 'CatBoost regressor')

0.9962330440522129


## From the above graphs, It clearly shows that Dimensionality Reduction leads to performance drop. We can check by running again without changing anything from the dataset.

In [78]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y1','Y2'],axis=1)
Y = df.drop(['X1','X2','X3','X4','X5','X6','X7','X8'],axis=1)
lr_trn_score,rfr_trn_score,sgd_trn_score,en_trn_score,abr_trn_score,gbr_trn_score,svr_trn_score,xgb_trn_score,cbr_trn_score = [],[],[],[],[],[],[],[],[]
lr_test_score,rfr_test_score,sgd_test_score,en_test_score,abr_test_score,gbr_test_score,svr_test_score,xgb_test_score,cbr_test_score = [],[],[],[],[],[],[],[],[]
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
    
    lr = LinearRegression().fit(x_train, y_train)
    pred = lr.predict(x_test)
    pred_trn = lr.predict(x_train)
    lr_test_score.append(r2_score(y_test, pred))
    lr_trn_score.append(r2_score(y_train, pred_trn))
    
    sgd = MultiOutputRegressor(SGDRegressor()).fit(x_train,y_train)
    pred = sgd.predict(x_test)
    pred_trn = sgd.predict(x_train)
    sgd_test_score.append(r2_score(y_test, pred))
    sgd_trn_score.append(r2_score(y_train, pred_trn))
    
    en = ElasticNet().fit(x_train,y_train)
    pred = en.predict(x_test)
    pred_trn = en.predict(x_train)
    en_test_score.append(r2_score(y_test, pred))
    en_trn_score.append(r2_score(y_train, pred_trn))
    
    abr = MultiOutputRegressor(AdaBoostRegressor()).fit(x_train,y_train)
    pred = abr.predict(x_test)
    pred_trn = abr.predict(x_train)
    abr_test_score.append(r2_score(y_test, pred))
    abr_trn_score.append(r2_score(y_train, pred_trn))
    
    gbr = MultiOutputRegressor(GradientBoostingRegressor()).fit(x_train,y_train)
    pred = gbr.predict(x_test)
    pred_trn = gbr.predict(x_train)
    gbr_test_score.append(r2_score(y_test, pred))
    gbr_trn_score.append(r2_score(y_train, pred_trn))
    
    svr = MultiOutputRegressor(SVR()).fit(x_train,y_train)
    pred = svr.predict(x_test)
    pred_trn = svr.predict(x_train)
    svr_test_score.append(r2_score(y_test, pred))
    svr_trn_score.append(r2_score(y_train, pred_trn))
    
    xgb = MultiOutputRegressor(XGBRegressor()).fit(x_train,y_train)
    pred = xgb.predict(x_test)
    pred_trn = xgb.predict(x_train)
    xgb_test_score.append(r2_score(y_test, pred))
    xgb_trn_score.append(r2_score(y_train, pred_trn))
    
    cbr = MultiOutputRegressor(CatBoostRegressor(verbose=0)).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))
    
    
    rfr = RandomForestRegressor().fit(x_train, y_train)
    pred = rfr.predict(x_test)
    pred_trn = lr.predict(x_train)
    rfr_test_score.append(r2_score(y_test, pred))
    rfr_trn_score.append(r2_score(y_train, pred_trn))

100%|██████████| 1000/1000 [34:04<00:00,  2.04s/it] 


## CatBoostRegressor

In [79]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

## The r2 score is around 98 when dimensionality reduction and feature engineering applied. But, without dimensionality reduction and feature engineering, the r2 score is more than 0.995 on both train and test score.

In [80]:
print("Train Accuracy :",np.mean(cbr_trn_score)*100)
print("Test Accuracy :",np.mean(cbr_test_score)*100)

Train Accuracy : 99.94105122035963
Test Accuracy : 99.68085255927099


In [81]:
pred

array([[36.2183 , 39.1086 ],
       [12.1999 , 14.9466 ],
       [36.5728 , 37.2743 ],
       [14.6368 , 17.0479 ],
       [24.2178 , 25.9339 ],
       [32.2173 , 33.1127 ],
       [36.1482 , 36.2495 ],
       [25.3945 , 26.8092 ],
       [14.2866 , 15.0643 ],
       [29.2729 , 30.569  ],
       [14.8952 , 15.576  ],
       [12.9644 , 15.6852 ],
       [26.2406 , 28.0722 ],
       [42.0551 , 42.334  ],
       [40.4267 , 39.8254 ],
       [36.6571 , 36.9522 ],
       [28.5724 , 31.5279 ],
       [14.4487 , 16.7673 ],
       [14.4393 , 17.0646 ],
       [23.9168 , 25.5309 ],
       [32.2052 , 35.2163 ],
       [12.3751 , 15.3781 ],
       [39.3341 , 43.0809 ],
       [29.4039 , 29.5464 ],
       [14.2608 , 17.0458 ],
       [29.0632 , 31.4638 ],
       [12.8261 , 15.9376 ],
       [36.468  , 36.9184 ],
       [12.9689 , 15.8419 ],
       [26.6294 , 29.3116 ],
       [10.3552 , 13.6203 ],
       [11.5032 , 14.1564 ],
       [29.1492 , 31.2337 ],
       [16.7129 , 20.1717 ],
       [32.596

In [82]:
y_test

Unnamed: 0,Y1,Y2
112,35.65,41.07
414,12.10,15.57
256,37.03,34.99
561,14.70,17.00
194,24.04,26.18
...,...,...
199,29.79,29.92
466,12.67,15.83
148,28.07,34.14
393,29.40,32.93


In [83]:
test_values = pd.DataFrame(y_test)
test_values.reset_index(drop=True,inplace=True)
test_values

Unnamed: 0,Y1,Y2
0,35.65,41.07
1,12.10,15.57
2,37.03,34.99
3,14.70,17.00
4,24.04,26.18
...,...,...
149,29.79,29.92
150,12.67,15.83
151,28.07,34.14
152,29.40,32.93


In [84]:
result = pd.DataFrame(pred, columns = ['Predicted Y1', 'Predicted Y2'])
result

Unnamed: 0,Predicted Y1,Predicted Y2
0,36.2183,39.1086
1,12.1999,14.9466
2,36.5728,37.2743
3,14.6368,17.0479
4,24.2178,25.9339
...,...,...
149,28.6452,33.3159
150,12.6679,15.6556
151,29.2666,30.0071
152,29.6490,28.7731


In [85]:
final_y1 = pd.merge(test_values['Y1'], result['Predicted Y1'], left_index=True,right_index=True)
final_y1

Unnamed: 0,Y1,Predicted Y1
0,35.65,36.2183
1,12.10,12.1999
2,37.03,36.5728
3,14.70,14.6368
4,24.04,24.2178
...,...,...
149,29.79,28.6452
150,12.67,12.6679
151,28.07,29.2666
152,29.40,29.6490


In [86]:
final_y2 = pd.merge(test_values['Y2'], result['Predicted Y2'], left_index=True,right_index=True)
final_y2

Unnamed: 0,Y2,Predicted Y2
0,41.07,39.1086
1,15.57,14.9466
2,34.99,37.2743
3,17.00,17.0479
4,26.18,25.9339
...,...,...
149,29.92,33.3159
150,15.83,15.6556
151,34.14,30.0071
152,32.93,28.7731


In [87]:
visulaize_performance_of_the_model(final_y2['Y2'], final_y2['Predicted Y2'], 'CatBoost regressor')

In [88]:
df = pd.DataFrame(pd.read_excel('C:\\Users\\harip\\INEURON_PROJECTS\\Energy Efficiency\\energy+efficiency\\ENB2012_data.xlsx'))
X = df.drop(['Y2'],axis=1)
Y = df['Y2']
cbr_trn_score = []
cbr_test_score = []
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
        
    cbr = CatBoostRegressor(verbose=0).fit(x_train,y_train)
    pred = cbr.predict(x_test)
    pred_trn = cbr.predict(x_train)
    cbr_test_score.append(r2_score(y_test, pred))
    cbr_trn_score.append(r2_score(y_train, pred_trn))   

100%|██████████| 1000/1000 [15:17<00:00,  1.09it/s]


In [89]:
fig = make_subplots(rows = 2, cols = 1)
fig.append_trace(go.Scatter(y = cbr_test_score, name = 'Test Score'), row=1, col=1)
fig.append_trace(go.Scatter(y = cbr_trn_score, name = 'Train Score'), row=2, col=1)
fig.update_layout(title = 'Train vs Test Score on CatBoostRegressor')
fig.show()

In [90]:
visulaize_performance_of_the_model(pred, y_test, 'CatBoost regressor')

## In our project development, we're going to follow the below rules :
### 1 .Prediction of Y1 with the independent variables of X1,X2, X3, X4,X5,X6,X7 and X8.
### 2 .Prediction of Y2 with the independent variables of X1,X2, X3, X4,X5,X6,X7,X8 and dependent feature of Y1. The reason for adding Y1 for Y2 prediction is