# Machine Learning: Regression - Predicting Energy Efficiency of Buildings

**Dataset Information**(**__[Dataset Source](https://drive.google.com/file/d/1Eru_UHVc3WLHVveC9Q8K9QUxlzYeHt18/view?usp=share_link)__**)

The dataset for the remainder of this quiz (from question 18) is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.


**Attribute Information**

Date, time year-month-day hour:minute:second

Appliances, energy use in Wh 

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional

In [1]:
#importing libraries for preprocessing, visualisation, machine learning and ignore warnings
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

In [2]:
# Reading dataset
df=pd.read_csv('energydata_complete.csv')

In [3]:
# taking a look at the dataset
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [5]:
# looking through dataset for NaN entries
df.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

In [6]:
# statistical details about features
df.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [7]:
# shape of dataset
df.shape

(19735, 29)

## Questions


From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the Root Mean Squared error in three D.P?

In [8]:
# Assigning values X and y
X = df['T2'].array.reshape(-1, 1)
y = df["T6"].array.reshape(-1, 1)

#Splitting dataset into training and test dataset
x_train, x_test, y_train, y_test=train_test_split(X,y,test_size= 0.3, random_state=42)

#creating and fitting a Linear Rregression Model
linear_model=LinearRegression()
linear_model.fit(x_train, y_train)

#obtain prediction
predicted_values = linear_model.predict(x_test)

#Find value for RMSE 
rmse = np.sqrt(mean_squared_error(y_test, predicted_values)).round(3)
print("The Root Mean Squared Error is:",rmse)

The Root Mean Squared Error is: 3.63



Remove the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a  random state of 42 (for reproducibility).
 Normalize the dataset using the MinMaxScaler (Hint: Use the MinMaxScaler fit_transform and transform methods on the train and test set respectively).
 Run a multiple linear regression using the trainin and test set:

In [9]:
#dropping features
new=df.drop(['lights','date'],axis = 1)

#Using min max scaler to normalize the dataset to a common scale, normalizing dataset and assigning values for X and y
scaler=MinMaxScaler()
norm=pd.DataFrame(scaler.fit_transform(new), columns=new.columns)

#Assigning values for X and y
X=norm.drop(["Appliances"],axis = 1)
y=norm['Appliances']

#Splitting dataset into training and test dataset
x_train, x_test, y_train, y_test=train_test_split(X,y,test_size=0.3, random_state=42)

#Creating a Linear Regresssion model and fitting the respective values
linear_model=LinearRegression()
linear_model.fit(x_train, y_train)

#obtain prediction for test and train set
pred_test = linear_model.predict(x_test)
pred_train = linear_model.predict(x_train)

What is the Mean Absolute Error and Root Mean Squared Error (in three decimal places) for Train and test set.


In [10]:
# Creating a function that prints the values of necessary errors for predicted test and train set
import sklearn.metrics as metrics
def regression_results(value, predicted_values):

    # Regression metrics
    mean_absolute_error = metrics.mean_absolute_error(value, predicted_values) 
    mse = metrics.mean_squared_error(value, predicted_values)
  
    print('Mean Absolute Error is:', round(mean_absolute_error,3))
    print('Root Mean Squared Error is:', round(np.sqrt(mse),3))

In [11]:
print("For Train set\n")
regression_results(y_train,pred_train)

For Train set

Mean Absolute Error is: 0.05
Root Mean Squared Error is: 0.089


In [12]:
print("For Test set\n")
regression_results(y_test,pred_test)

For Test set

Mean Absolute Error is: 0.05
Root Mean Squared Error is: 0.088


Did the Model above overfit to the training set

In [13]:
#presenting score for linear model to verify overfit
print("Training set score: {:.2f}".format(linear_model.score(x_train, y_train)))
print("Test set score: {:.2f}".format(linear_model.score(x_test, y_test)))

print("No, the model did not overfit.")

Training set score: 0.14
Test set score: 0.15
No, the model did not overfit.


Train a ridge regression model with default parameters. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [14]:
#creating a ridge regression model, fitting it and predicted values for x_test
from sklearn.linear_model import Ridge
ridge=Ridge()
ridge.fit(x_train, y_train)
ridge_pred = ridge.predict(x_test) 

#evaluating metrics
rmse=np.sqrt(mean_squared_error(y_test, ridge_pred))


print('rmse is', round(rmse,3), "\nAns: No")

rmse is 0.088 
Ans: No



Train a lasso regression model with default value and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [15]:
# Creating a lasso regression mode, fitting and predicting x_test
from sklearn.linear_model import Lasso
lasso_reg=Lasso()
lasso_reg.fit(x_train, y_train)
lasso_pred = lasso_reg.predict(x_test) 

#Obtaining non_zero feature weights
print("Features with Non-zero weights equals:", np.sum(lasso_reg.coef_ != 0))

Features with Non-zero weights equals: 0


What is the new RMSE with the Lasso Regression on the test set?

In [16]:
#Finding RMSE for Lasso Regression
rmse=np.sqrt(mean_squared_error(y_test, lasso_pred))

print('RMSE for Lasso Regression is:',round(rmse,3))

RMSE for Lasso Regression is: 0.095


## END OF NOTEBOOK