 
# PROJECT: PREDICTING ENERGY EFFICIENCY OF BUILDINGS
# (Using Regression Models)

# Dataset Description

Appliances Energy Prediction Dataset

The dataset for the remainder of this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:

Date: time year-month-day hour:minute:second

Appliances: energy use in Wh

lights: energy use of light fixtures in the house in Wh

T1: Temperature in kitchen area, in Celsius

RH_1: Humidity in kitchen area, in %

T2: Temperature in living room area, in Celsius

RH_2: Humidity in living room area, in %

T3: Temperature in laundry room area

RH_3:  Humidity in laundry room area, in %

T4: Temperature in office room, in Celsius

RH_4: Humidity in office room, in %

T5: Temperature in bathroom, in Celsius

RH_5: Humidity in bathroom, in %

T6: Temperature outside the building (north side), in Celsius

RH_6: Humidity outside the building (north side), in %

T7: Temperature in ironing room , in Celsius

RH_7: Humidity in ironing room, in %

T8: Temperature in teenager room 2, in Celsius

RH_8: Humidity in teenager room 2, in %

T9: Temperature in parents room, in Celsius

RH_9: Humidity in parents room, in %

To: Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station): in mm Hg

RH_out: Humidity outside (from Chievres weather station): in %

Wind speed (from Chievres weather station): in m/s

Visibility (from Chievres weather station): in km

Tdewpoint (from Chievres weather station): Â°C

rv1: Random variable 1, nondimensional

rv2: Random variable 2, nondimensional

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
E_df = pd.read_csv(r'C:\Users\Joshua Ayobami\Desktop\Joshua\DATA SCIENCE\REAL WORLD DATA REPOSITORY\energydata_complete.csv')
E_df

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.890000,47.596667,19.200000,44.790000,19.790000,44.730000,19.000000,...,17.033333,45.5300,6.600000,733.5,92.000000,7.000000,63.000000,5.300000,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.890000,46.693333,19.200000,44.722500,19.790000,44.790000,19.000000,...,17.066667,45.5600,6.483333,733.6,92.000000,6.666667,59.166667,5.200000,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.890000,46.300000,19.200000,44.626667,19.790000,44.933333,18.926667,...,17.000000,45.5000,6.366667,733.7,92.000000,6.333333,55.333333,5.100000,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.890000,46.066667,19.200000,44.590000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.250000,733.8,92.000000,6.000000,51.500000,5.000000,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.890000,46.333333,19.200000,44.530000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.133333,733.9,92.000000,5.666667,47.666667,4.900000,10.084097,10.084097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,2016-05-27 17:20:00,100,0,25.566667,46.560000,25.890000,42.025714,27.200000,41.163333,24.700000,...,23.200000,46.7900,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.500000,46.500000,25.754000,42.080000,27.133333,41.223333,24.700000,...,23.200000,46.7900,22.600000,755.2,56.000000,3.500000,24.500000,13.300000,49.282940,49.282940
19732,2016-05-27 17:40:00,270,10,25.500000,46.596667,25.628571,42.768571,27.050000,41.690000,24.700000,...,23.200000,46.7900,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.500000,46.990000,25.414000,43.036000,26.890000,41.290000,24.700000,...,23.200000,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784


In [3]:
E_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

# MODELING

In [4]:
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [5]:
#splitting the data in x and y
x = E_df[['T2']]
y = E_df[['T6']]

In [6]:
%%time
linear_model = LinearRegression()

linear_model.fit(x, y)

Wall time: 48 ms


LinearRegression()

In [8]:
r2_score1 = r2_score(x,y)
round(r2_score1, 2)

-35.39

In [9]:
# normailizing


E_df1 = E_df.drop(['date','lights'], axis =1)
E_df1.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [10]:
#Firstly, we normalise our dataset to a common scale using the min max scaler

from sklearn.preprocessing import MinMaxScaler

In [11]:
Scaler =MinMaxScaler()
normalised_df = pd.DataFrame(Scaler.fit_transform(E_df1), columns = E_df1.columns)

In [12]:
features_df = normalised_df.drop(columns=[ 'Appliances'])
heating_target = normalised_df[ 'Appliances' ]

In [13]:
features_df.head()

Unnamed: 0,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.32735,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,0.381691,...,0.223032,0.67729,0.37299,0.097674,0.894737,0.5,0.953846,0.538462,0.265449,0.265449
1,0.32735,0.541326,0.225345,0.68214,0.215188,0.748871,0.351351,0.782437,0.175506,0.381691,...,0.2265,0.678532,0.369239,0.1,0.894737,0.47619,0.894872,0.533937,0.372083,0.372083
2,0.32735,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,0.380037,...,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.32735,0.52408,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,0.380037,...,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.32735,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,0.380037,...,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611


In [14]:
x =features_df
y = heating_target

In [15]:
#Now, we split our dataset into the training and testing dataset. Recall that we
#had earlier segmented the features and target variables

# splitting data in train_data and test_data

x_train, x_test, y_train, y_test = train_test_split(features_df, heating_target,
test_size=0.3 , random_state= 42)

In [16]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((13814, 26), (5921, 26), (13814,), (5921,))

In [17]:
#fit the model to the training dataset
linear_model.fit(x_train, y_train)

LinearRegression()

In [18]:
# Question 12
# From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

# making predictions
predicted_value = linear_model.predict(x_test)
predicted_value

array([0.03322207, 0.24411599, 0.03400024, ..., 0.06844707, 0.10032325,
       0.05722198])

In [19]:
# Question 13
# Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

# What is the Mean Absolute Error (in two decimal places)?

Mae = mean_absolute_error(y_test, predicted_value)
round(Mae,2)

0.05

In [20]:
# Question 14
# What is the Residual Sum of Squares (in two decimal places)?

rss = np.sum(np.square(y_test - predicted_value))
round(rss,2)

45.35

In [21]:
# Question 15
# What is the Root Mean Squared Error (in three decimal places)?

Ms_ = mean_squared_error(y_test, predicted_value)
round(np.sqrt(Ms_),3) #root mean sqaured error

0.088

In [22]:
# Question 16
# What is the Coefficient of Determination (in two decimal places)?

r2_score1 = r2_score(y_test ,  predicted_value)
round(r2_score1, 2)

0.15

In [24]:
ridge_reg =Ridge(alpha=0.4)
ridge_reg.fit(x_train, y_train)

Ridge(alpha=0.4)

In [25]:
# Question 18
# Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?


pridicted_values =ridge_reg.predict(x_test) 
Ms_ = mean_squared_error(y_test, pridicted_values)
round(np.sqrt(Ms_),3)

0.088

In [26]:
lasso_reg =Lasso(alpha=0.001)
lasso_reg.fit(x_train, y_train)

Lasso(alpha=0.001)

In [27]:
#comparing the effects of regularisation
def get_weights_df(linear_model, feat, col_name) :
    #this function returns the weight of every feature
    weights = pd.Series(linear_model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = [ 'Features' , col_name]
    weights_df[col_name].round( 3 )
    return weights_df
linear_model_weights = get_weights_df(linear_model, x_train, 'Linear_Model_Weight' )
ridge_weights_df = get_weights_df(ridge_reg, x_train, 'Ridge_Weight' )
lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight' )
final_weights = pd.merge(linear_model_weights, ridge_weights_df, on= 'Features' )
final_weights = pd.merge(final_weights, lasso_weights_df, on= 'Features')

In [28]:
# Question 17
# Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?
# final_weights1 = pd.merge(linear_model_weights, ridge_weights_df, on= 'Features' )
final_weights

Unnamed: 0,Features,Linear_Model_Weight,Ridge_Weight,Lasso_weight
0,RH_2,-0.456698,-0.411071,-0.0
1,T_out,-0.32186,-0.262172,0.0
2,T2,-0.236178,-0.201397,0.0
3,T9,-0.189941,-0.188916,-0.0
4,RH_8,-0.157595,-0.15683,-0.00011
5,RH_out,-0.077671,-0.054724,-0.049557
6,RH_7,-0.044614,-0.045977,-0.0
7,RH_9,-0.0398,-0.041367,-0.0
8,T5,-0.015657,-0.019853,-0.0
9,T1,-0.003281,-0.018406,0.0


In [29]:
# Question 19
# Train a lasso regression model with an alpha value of 0.001 and obtain the new feature
# weights with it. How many of the features have non-zero feature weights?


lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight' )
lasso_weights_df

Unnamed: 0,Features,Lasso_weight
0,RH_out,-0.049557
1,RH_8,-0.00011
2,T1,0.0
3,Tdewpoint,0.0
4,Visibility,0.0
5,Press_mm_hg,-0.0
6,T_out,0.0
7,RH_9,-0.0
8,T9,-0.0
9,T8,0.0


In [30]:
# Question 20
# What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

pridicted_values =lasso_reg.predict(x_test) 
Ms_ = mean_squared_error(y_test, pridicted_values)
round(np.sqrt(Ms_),3)

0.094

In [31]:
final_weights

Unnamed: 0,Features,Linear_Model_Weight,Ridge_Weight,Lasso_weight
0,RH_2,-0.456698,-0.411071,-0.0
1,T_out,-0.32186,-0.262172,0.0
2,T2,-0.236178,-0.201397,0.0
3,T9,-0.189941,-0.188916,-0.0
4,RH_8,-0.157595,-0.15683,-0.00011
5,RH_out,-0.077671,-0.054724,-0.049557
6,RH_7,-0.044614,-0.045977,-0.0
7,RH_9,-0.0398,-0.041367,-0.0
8,T5,-0.015657,-0.019853,-0.0
9,T1,-0.003281,-0.018406,0.0


In [32]:
linear_model.coef_

array([-0.00328105,  0.5535466 , -0.23617792, -0.45669795,  0.29062714,
        0.09604827,  0.028981  ,  0.02638578, -0.01565684,  0.01600579,
        0.23642491,  0.03804865,  0.01031878, -0.04461364,  0.10199505,
       -0.15759548, -0.18994077, -0.03980032, -0.32185967,  0.00683933,
       -0.07767065,  0.02918313,  0.01230661,  0.11775773,  0.0007701 ,
        0.0007701 ])

### THE END!