### **Energy-Consumption-Prediction**
Using Regession Analysis

### Project Description
The aim is to build a predictive model on consumption of energy.

### Features: 

**OSEBuildingID**: Unique Building ID

**BuildingType**: Type of Building

**PrimaryPropertyType**: Type of the Primary Property

**PropertyName**: Name of the Building

**Latitude**: Site Latitude

**Longitude**: Site Longitude

**YearBuilt**: The year of Built

**NumberofBuildings**: The Number of buildings

**NumberofFloors**: The Number of Floors

**PropertyofGFATotal**: Property GFA Total

**PropertyofGFAParking**: Property GFA of Parking

**PropertyofGFABuilding(s)**: Property GFA of Buildings

**LargestPropertyUseType**: Type of use of the largest property

**ENERGYSTARScore**: Score of the energy

**SiteEnergyUse(kBtu)**: Amount of Energy recorded on Site

**SiteEnergyUseWN(kBtu)**: Amount of Energy recorded on Site. This is the outcome variable to be predicted.

**Electricity(kWh)**: Amount of Electricity Used

### Data Exploration: 
   Looking at categorical and continuous feature summaries and making inferences about the data.
### Data Cleaning:
   Imputing missing values in the data and checking for outliers
### Feature Engineering:
   Odifying existing variables and creating new ones for analysis
### Model Building:
   Making predictive models on the data
### Metrics:
   Check the effectiveness of our predictive models

**Importing the librairies**

We are going to import the essentials librairies that we'll use to visualize the dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

**Importing the dataset**

By using Pandas, We'll import the dataset

In [None]:
# reading the dataset
energy = pd.read_csv("batiment.csv", sep=";")

# making copies of our dataset
energy = energy.copy()

# displaying the first five lines of our dataset
energy.head()

Now That the Dataset is successufully Imported, Let's describe our Dataset.

In [None]:
# checking the columns of the dataset
print(energy.columns)

In [None]:
# Checking the shape of the dataset
print(energy.shape)

The Dataset contains 3376 rows & 17 columns as we seen above. Here We are going to check the first Statistical data of the dataset.

In [None]:
# describing briefly the dataset
energy.describe()

**DATA VISUALIZATION**

Visualization is an important part of the process for undestanding our dataset. So We are going to visualize some features.

**- Univariate Data Analysis**

Visualizing the distribution of SiteEnergyUse (the outcome)

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Plotting SiteEnergyUse(kBtu)
sns.displot(energy['SiteEnergyUse(kBtu)'], bins=10, kde=True, rug=True, color='green', height=6, aspect=1)

#set title of the plot
plt.title('Distribution of SiteEnergyUse', 
          fontdict = {'verticalalignment': 'baseline', 'color': 'green'},
          loc='center', pad=10.0)

**Conclusion:** We notice that SiteEnergyUse(kBtu) is highly asymmetric.

Visualizing SiteEnergyUse with boxplot

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#set title of the plot
plt.title('Visualization of SiteEnergyUse With Boxplot', color='green',
          loc='right', pad=10.0)

# Plotting SiteEnergyUse(kBtu)
sns.boxplot(x =energy['SiteEnergyUse(kBtu)'])
sns.stripplot(x = energy['SiteEnergyUse(kBtu)'], color = 'darkgreen')

Visualizing Electricity with boxplot

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

# set title of the plot
plt.title('Visualization of Electricity With Boxplot', color='blue',
          loc='right', pad=10.0)

# Plotting Electricity(kWh)
sns.boxplot(x =energy['Electricity(kWh)'])
sns.stripplot(x = energy['Electricity(kWh)'], color = 'darkblue')

Visualizing ENERGYSTARScore with boxplot

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#set title of the plot
plt.title('Visualization of ENERGYSTARScore With Boxplot', color='blue',
          loc='right', pad=10.0)

# Plotting ENERGYSTARScore
sns.boxplot(x =energy['ENERGYSTARScore'], color='gray')

Visualizing the type of Building

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Plotting BuildingType
ax = sns.countplot(x=energy['BuildingType'])

# set title of the plot
ax.set_title("Visualization of the Type of Buildings", color='Brown', fontsize = 15)

Visualizing the primary type of Building

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(45,20))

#Plotting PrimaryPropertyType
ax = sns.countplot(x=energy['PrimaryPropertyType'])

#Set title of the plot
ax.set_title("Visualization of the Primary Property Type", color='brown', fontsize = 30)

**- Bivariate Data Analysis**

Visualizing SiteEnergy Use of Buildings

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of Energy Use with Building', color = 'purple',
          loc='right', pad=10.0)

# Ploting BuildingType with SiteEnergyUse
sns.boxplot(x=energy['BuildingType'], y=energy['SiteEnergyUse(kBtu)'], color='orange', width=0.4)

Visualizing the Distribution of Year based on PrimaryPropertyType

In [None]:
#Importing the librarie Joypy 
from joypy import joyplot
from matplotlib import cm

#Plotting the PrimaryPropertyType & YearBuilt
joyplot(energy, by = 'PrimaryPropertyType', column = 'YearBuilt', kind="kde", colormap = cm.autumn_r, fade = True, 
        range_style='own', figsize = (15,15), 
        title = 'Distribution of Year based on PrimaryPropertyType')

# display
plt.show()

Checking the number of Buildings by Type

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of Number of Buildings by Type', color = 'purple',
          loc='right', pad=10.0)

# PLotting BuildingType with NumberofBuildings
sns.boxplot(x=energy['BuildingType'], y=energy['NumberofBuildings'], color='orange', width=0.2)

Visualizing the behavior of the PropertyGFATotal with ENERGYSTARScore by BuildingType

In [None]:
#Set title of the plot
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of PropertyGFATotal with ENERGYSTARScore by Building', color ='green',
          loc='right', pad=10.0)

# PLotting PropertyGFATotal with ENERGYSTARScore by BuildingType
sns.scatterplot(x=energy['PropertyGFATotal'], y=energy['ENERGYSTARScore'], hue=energy['BuildingType'])

Visualizing the behavior of the BuildingType with ENERGYSTARScore by Building Type

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of BuildingType with ENERGYSTARScore by Building', color = 'purple',
          loc='right', pad=10.0)

#PLotting BuildingType with ENERGYSTARScore by BuildingType
sns.scatterplot(x=energy['BuildingType'], y=energy['ENERGYSTARScore'], hue=energy['BuildingType'])

Visualizing the behavior of the YearBuilt with SiteEnergyUse(kBtu) by BuildingType

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of YearBuilt with SiteEnergyUse(kBtu) by BuildingType', color ='blue',
          loc='right', pad=10.0)

#PLotting YearBuilt with SiteEnergyUse(kBtu) by BuildingType
sns.scatterplot(x=energy['YearBuilt'], y=energy['SiteEnergyUse(kBtu)'], hue=energy['BuildingType'])

Visualizing the behavior of the YearBuilt with Electricity(kWh) by BuildingType

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualisation the behavior of YearBuilt with Electricity(kWh) by Building', color = 'blue',
          loc='right', pad=10.0)

#PLotting YearBuilt with Electricity(kWh) by BuildingType
sns.scatterplot(x=energy['YearBuilt'], y=energy['Electricity(kWh)'], hue=energy['BuildingType'])

Visualizing the behavior of the Electricity(kWh) with SiteEnergyUse(kBtu)

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of the Electricity(kWh) with SiteEnergyUse(kBtu)', color ='darkblue',
          loc='right', pad=10.0)

#PLotting Electricity(kWh) with SiteEnergyUse(kBtu)
sns.scatterplot(x=energy['Electricity(kWh)'], y=energy['SiteEnergyUse(kBtu)'], color='darkblue')

Visualizing the distribution of the ENERGYSTARScore by BuildingType

In [None]:
# Set figure size for the notebook
plt.figure(figsize=(15,5))

# Without transparency
sns.kdeplot(data=energy, x=energy['ENERGYSTARScore'], hue=energy['BuildingType'], fill=True, common_norm=False, alpha=1, warn_singular=False)

#show
plt.show()

Visualizing the behavior of PropertyGFATotal by BuildingType

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualizing the behavior of PropertyGFATotal by BuildingType', color= 'purple',
          loc='right', pad=10.0)

#PLotting BuildingType with PropertyGFATotal
sns.stripplot(x="BuildingType", y="PropertyGFATotal", data=energy)

Visualizing the behavior of the Electricity(kWh) with SiteEnergyUseWN(kBtu) by ENERGYSTARScore

In [None]:
#Setting figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of the Electricity(kWh) with SiteEnergyUseWN(kBtu) by ENERGYSTARScore', color = 'red',
          loc='right', pad=10.0)

#PLotting Electricity(kWh) with SiteEnergyUseWN(kBtu) by ENERGYSTARScore
sns.scatterplot(x=energy['Electricity(kWh)'], y=energy['SiteEnergyUseWN(kBtu)'], hue=energy['ENERGYSTARScore'], color='red')

Visualizing the behavior of NumberofBuildings with SiteEnergyUse(kBtu) by YearBuilt

In [None]:
# Set figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of NumberofBuildings with SiteEnergyUse(kBtu) by YearBuilt', color = 'purple',
          loc='right', pad=10.0)

#Plotting NumberofBuildings and SiteEnergyUse(kBtu) by YearBuilt
sns.scatterplot(x=energy['SiteEnergyUse(kBtu)'], y=energy['NumberofBuildings'], 
                hue=energy['YearBuilt'])

Visualizing the behavior of SiteEnergyUseWN(kBtu) with SiteEnergyUse(kBtu) by BuildingType

In [None]:
# Set figure size for the notebook
plt.figure(figsize=(15,5))

#Set title of the plot
plt.title('Visualization the behavior of SiteEnergyUseWN(kBtu) with SiteEnergyUse(kBtu) by BuildingType', color ='red',
          loc='right', pad=10.0)

#Plotting SiteEnergyUseWN(kBtu) with SiteEnergyUse(kBtu) by BuildingType
sns.scatterplot(x=energy['SiteEnergyUseWN(kBtu)'], y=energy['SiteEnergyUse(kBtu)'], hue=energy['BuildingType'], color='red', data=energy)

**- Multivariate Analysis**

Large Exploration of our Dataset with PairPlot

In [None]:
#Plotting the Dataset
sns.pairplot(energy, hue ='BuildingType')

#Show
plt.show()

**DATA PREPROCESSING**

**- Realizing an ACP & Printing correlation matrix**

In [None]:
# making another copies of our Dataset
dataset = energy.copy()

Correlation Matrix

In [None]:
# Correlation 
correlate = dataset.corr()
round(correlate,2)

In [None]:
# CORRELATION MATRIX 

from string import ascii_letters

sns.set_theme(style="white")

# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 26)),
                 columns=list(ascii_letters[26:]))

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(correlate, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlate, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

From the results of the correlation matrix, we can deduce that SiteEnergyUseWN and SiteEnergyUse are Highly correlated. So we have to drop one. Same goes to Electricity and SiteEnergyUseWN.

**- Identifing the most important variables**

In [None]:
# Removing useless features
newdata = dataset.drop(columns=['Electricity(kWh)', 'Longitude', 'Latitude', 'SiteEnergyUseWN(kBtu)'])

#Printing the shape
print(newdata.shape)

**- Checking columns with null values and outliers**

In [None]:
# Getting unique values
newdata.apply(lambda x: len(x.unique()))

In [None]:
# Checking columns with null values
newdata.columns[newdata.isnull().any()]

In [None]:
# Checking null values
newdata.isnull().sum()

In [None]:
# imputing missing values and outliers

newdata['NumberofBuildings'] = newdata['NumberofBuildings'].replace(0, np.NaN)
newdata['NumberofBuildings'].fillna(newdata['NumberofBuildings'].median(), inplace = True)

newdata['SiteEnergyUse(kBtu)'] = newdata['SiteEnergyUse(kBtu)'].replace(0, np.NaN)
newdata['SiteEnergyUse(kBtu)'].fillna(newdata['SiteEnergyUse(kBtu)'].median(), inplace = True)

newdata['LargestPropertyUseType'] = newdata['LargestPropertyUseType'].replace(0, np.NaN)
newdata['LargestPropertyUseType'].fillna('Unknown', inplace=True)

newdata['ENERGYSTARScore'] = newdata['ENERGYSTARScore'].replace(0, np.NaN)
newdata['ENERGYSTARScore'].fillna(newdata['ENERGYSTARScore'].median(), inplace = True)

newdata.isnull().sum()

**- Mangemenent of Outliers**

After the visualisation, we notice that there are some outliers in our dataset. So we'll check for outliers and erase all of them because Regression is very sensitive to outliers.

First of All, We are going to convert SiteEnergyUse in log for high efficence Modelling

In [None]:
# Converting SiteEnergyUse(kBtu) in log

newdata['SiteEnergyUse(kBtu)'] = np.log(newdata['SiteEnergyUse(kBtu)'])

In [None]:
# Observe mean and std ... 
newdata['SiteEnergyUse(kBtu)'].describe()

Calculating the z score to get the outliers

In [None]:
# if we find z score exceeding 3, we'll consider as an outlier 
threshold = 3
mean = newdata['SiteEnergyUse(kBtu)'].mean()
std = newdata['SiteEnergyUse(kBtu)'].std()

#storing the outliers in an array
outliers = []
for i in newdata['SiteEnergyUse(kBtu)']:
    z = (i-mean)/std
    if z > threshold:
        outliers.append(i)

# displaying all the rows that contains an outlier
outliers_index = newdata[newdata['SiteEnergyUse(kBtu)'].isin(outliers)].index
outliers_index

Dropping Outliers of the Dataset

In [None]:
# making copies before removing outliers
newdata = newdata.copy()

#removing outliers in the dataset
newdata.drop(outliers_index,  inplace = True)

#Printing the Shape of Dataset
newdata.shape

Renaming SiteEnergyUse(kBtu) for Splitting

In [None]:
#Rename SiteEnergyUse(kBtu) in SiteEnergyUse for Splitting
newdata = newdata.rename(columns={"SiteEnergyUse(kBtu)": "SiteEnergyUse"})

**MODEL BUILDING**

The First step before modelling is turning the categorical values into numerical values using encoding.

In [None]:
# label encoding
from sklearn.preprocessing import LabelEncoder

newdata = newdata.apply(LabelEncoder().fit_transform)

Splitting the data into dependent and independent variables

In [None]:
# splitting the data into dependent and independent variables

x = newdata.drop('SiteEnergyUse', axis = 1)
y = newdata.SiteEnergyUse

print(x.shape)
print(y.shape)

In [None]:
# making x_train, x_test, y_train, y_test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

x_train = pd.DataFrame(x_train)
x_train.head()

**- Running Regression Algorithms**

- Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

model_lr = LinearRegression()

#Starting the modelling
start_lr = time.time()
model_lr.fit(x_train, y_train)

# predicting the  test set results
y_pred = model_lr.predict(x_test)

#Ending the modelling
end_lr = time.time()

# finding the rmse,r2, mae and execution time of the model
score_one = round(r2_score(y_test, y_pred), 3)
mae_one = round(mean_absolute_error(y_test, y_pred), 3)
rmse_one = round(mean_squared_error(y_test, y_pred), 3)
ex_time_lr = round(end_lr - start_lr, 3)

- GradientBoostingRegressor (XgBoost Regressor)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model_gb = GradientBoostingRegressor()

#Starting the modelling
start_gb = time.time()

model_gb.fit(x_train, y_train)

# predicting the test set results
y_pred = model_gb.predict(x_test)

#Ending the modelling
end_gb = time.time()

# finding the rmse,r2, mae and execution time of the model
score_two = round(r2_score(y_test, y_pred), 3)
mae_two = round(mean_absolute_error(y_test, y_pred), 3)
rmse_two = round(mean_squared_error(y_test, y_pred), 3)
ex_time_gb = round(end_gb - start_gb, 3)

- RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor(n_estimators = 100 , n_jobs = -1)

#Starting the modelling
start_rf = time.time()

model_rf.fit(x_train, y_train)

# predicting the  test set results
y_pred = model_rf.predict(x_test)

#Ending the modelling
end_rf = time.time()

# finding the rmse,r2, mae and execution time of the model
score_three = round(r2_score(y_test, y_pred), 3)
mae_three = round(mean_absolute_error(y_test, y_pred), 3)
rmse_three = round(mean_squared_error(y_test, y_pred), 3)
ex_time_rf = round(end_rf - start_rf, 3)

- DecisionTreeRegressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

model_dt = DecisionTreeRegressor()

#Starting the modelling
start_dt = time.time()

model_dt.fit(x_train, y_train)

# predicting the test set results
y_pred = model_dt.predict(x_test)

#Ending the modelling
end_dt = time.time()

# finding the rmse,r2, mae and execution time of the model
score_fourth = round(r2_score(y_test, y_pred), 3)
mae_fourth = round(mean_absolute_error(y_test, y_pred), 3)
rmse_fourth = round(mean_squared_error(y_test, y_pred), 3)
ex_time_dt = round(end_dt - start_dt, 3)

- Support Vector Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

# instantiating the model
model_lasso = Lasso()

#Starting the modelling
start_lasso = time.time()
model_lasso.fit(x_train, y_train)

# predicting the  test set results
y_pred = model_lasso.predict(x_test)

#Ending the modelling
end_lasso = time.time()

# finding the rmse,r2, mae and execution time of the model
score_five = round(r2_score(y_test, y_pred), 3)
mae_five = round(mean_absolute_error(y_test, y_pred), 3)
rmse_five = round(mean_squared_error(y_test, y_pred), 3)
ex_time_lasso = round(end_lasso - start_lasso, 3)

- Cross Validation

Here we are going to test our model in five random datasets

In [None]:
### Validation croise
from sklearn.model_selection import cross_val_score

cvs_rf = cross_val_score(estimator=model_rf , X = x_train, y = y_train , cv =5)
cvs_dt = cross_val_score(estimator=model_dt , X = x_train, y = y_train , cv =5)
cvs_gb = cross_val_score(estimator=model_gb , X = x_train, y = y_train , cv =5)
cvs_lr = cross_val_score(estimator=model_lr , X = x_train, y = y_train , cv =5)
cvs_lasso = cross_val_score(estimator=model_lasso , X = x_train, y = y_train , cv =5)

**Storing the results in Dataframe**

In [None]:
models = pd.DataFrame({
    'Model':["Linear Regression", "XgBoost Regressor","Random Forest Regression", "Decision Tree Regressor", "Lasso Regression"],
    "Score":[score_one, score_two, score_three, score_fourth, score_five], 
    "MAE": [mae_one, mae_two, mae_three, mae_fourth, mae_five] ,
    "RMSE": [rmse_one, rmse_two, rmse_three, rmse_fourth, rmse_five] ,
    "EX_TIME": [ex_time_lr, ex_time_gb, ex_time_rf, ex_time_dt, ex_time_lasso] ,
    "CROSS": [round(cvs_lr.mean()*100 , 2), round(cvs_gb.mean()*100 , 2), 
              round(cvs_rf.mean()*100 , 2), round(cvs_dt.mean()*100 , 2), 
              round(cvs_lasso.mean()*100 , 2)] ,
})
models.sort_values(by='Score',ascending=False)

**Optimization Modelling**

We will optimize the model RandomForest with GridSearch.

In [None]:
# Importing GridSearch
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

#setting the hyperparameters
param_grid = [
    {'n_estimators': [30, 50, 80, 100], 'max_features': [2, 4, 6, 8, 10, 12]},
]

grid_search = GridSearchCV(model_rf, param_grid, cv=5, verbose=2, 
                           scoring='neg_mean_squared_error', 
                           return_train_score=True)

grid_search.fit(x_train, y_train)

In [None]:
# displaying the best parameters
grid_search.best_params_

In [None]:
# displaying the results
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
# Displaying the best estimator
grid_search.best_estimator_

`Pernel AVOUGNASSOU`