#   Appliances Energy Prediction

The dataset for the remainder of this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:

Date, time year-month-day hour:minute:second

Appliances, energy use in Wh

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [2]:
# load in the dataset into a pandas dataframe
energy_df = pd.read_csv('energydata_complete.csv')
energy_df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [3]:
# print df shapes and data types
print(energy_df.shape)
energy_df.dtypes

(19735, 29)


date            object
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
rv1            float64
rv2            float64
dtype: object

In [4]:
energy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [5]:
# check for null value
energy_df.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

In [6]:
# Checking for duplicates
energy_df.duplicated().sum()

0

- Drop the date and light column

In [7]:
energy_df = energy_df.drop(['date', 'lights'], axis=1)

In [8]:
energy_df.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


## Measuring Regression Performance

In [9]:
#Firstly, we normalise our dataset to a common scale using the min max scaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

energy_normalised_df = pd.DataFrame(scaler.fit_transform(energy_df), columns=energy_df.columns)

#get feature
x = energy_normalised_df.drop(columns=['Appliances'])
y = energy_normalised_df['Appliances']

In [10]:
#Now, we split our dataset into the training and testing dataset.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

### Linear Model

In [11]:
from sklearn import linear_model

lin_reg = linear_model.LinearRegression()

#fit the model to the training dataset
lin_reg.fit(x_train, y_train)

#obtain predictions
predicted_values = lin_reg.predict(x_test)

In [12]:
# checking linear training and test set score
round(lin_reg.score(x_train, y_train), 3), round(lin_reg.score(x_test, y_test), 3)

(0.145, 0.149)

#### Mean Absolute Error (MAE)

In [13]:
#MAE
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, predicted_values)
round(mae, 3)

0.05

#### Residual Sum of Squares (RSS) 

In [14]:
rss = np.sum(np.square(y_test - predicted_values))
round(rss, 3)

45.348

#### Root Mean Square Error (RMSE)

In [15]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, predicted_values))
round(rmse, 3)

0.088

#### R-Squared

In [16]:
from sklearn.metrics import r2_score

r2_score = r2_score(y_test, predicted_values)
round(r2_score, 3)

0.149

## Penalization

### Ridge Regression 

In [17]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(x_train, y_train)

#obtain predictions
ridge_pred = ridge_reg.predict(x_test)

In [18]:
# checking ridge training and test set score
round(ridge_reg.score(x_train, y_train), 3), round(ridge_reg.score(x_test, y_test), 3)

(0.145, 0.149)

#### Mean Absolute Error (MAE)

In [19]:
#MAE
mae = mean_absolute_error(y_test, ridge_pred)
round(mae, 3)

0.05

#### Residual Sum of Squares (RSS) 

In [20]:
rss = np.sum(np.square(y_test - ridge_pred))
round(rss, 3)

45.368

#### Root Mean Square Error (RMSE)

In [21]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, ridge_pred))
round(rmse, 3)

0.088

#### R-Squared

In [22]:
from sklearn.metrics import r2_score

r2_score = r2_score(y_test, ridge_pred)
round(r2_score, 3)

0.149

### Feature Selection and Lasso Regression 

In [23]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(x_train, y_train)

#obtain predictions
lasso_pred = lasso_reg.predict(x_test)

In [24]:
# checking lasso training and test set score
round(lasso_reg.score(x_train, y_train), 3), round(lasso_reg.score(x_test, y_test), 3)

(0.025, 0.027)

#### Mean Absolute Error (MAE)

In [25]:
#MAE
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, lasso_pred)
round(mae, 3)

0.055

#### Residual Sum of Squares (RSS) 

In [26]:
rss = np.sum(np.square(y_test - lasso_pred))
round(rss, 3)

51.853

#### Root Mean Square Error (RMSE)

In [27]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, lasso_pred))
round(rmse, 3)

0.094

#### R-Squared

In [28]:
from sklearn.metrics import r2_score

r2_score = r2_score(y_test, lasso_pred)
round(r2_score, 3)

0.027

### Regression weight
Note: This function is derived from Hamoye Course content

In [29]:
#comparing the effects of regularisation
def get_weights_df(model, feat, col_name):
    #this function returns the weight of every feature
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features', col_name]
    weights_df[col_name].round(3)
    return weights_df

In [30]:
linear_model_weights = get_weights_df(lin_reg, x_train, 'Linear_Model_Weight')
ridge_weights_df = get_weights_df(ridge_reg, x_train, 'Ridge_Weight')
lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight')

final_weights = pd.merge(linear_model_weights, ridge_weights_df, on='Features')
final_weights = pd.merge(final_weights, lasso_weights_df, on='Features')

In [31]:
final_weights

Unnamed: 0,Features,Linear_Model_Weight,Ridge_Weight,Lasso_weight
0,RH_2,-0.456698,-0.411071,-0.0
1,T_out,-0.32186,-0.262172,0.0
2,T2,-0.236178,-0.201397,0.0
3,T9,-0.189941,-0.188916,-0.0
4,RH_8,-0.157595,-0.15683,-0.00011
5,RH_out,-0.077671,-0.054724,-0.049557
6,RH_7,-0.044614,-0.045977,-0.0
7,RH_9,-0.0398,-0.041367,-0.0
8,T5,-0.015657,-0.019853,-0.0
9,T1,-0.003281,-0.018406,0.0


From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

In [32]:
#select a sample of the dataset
tem_reg_df = energy_normalised_df[['T2', 'T6']]

tem_reg_df.head()

Unnamed: 0,T2,T6
0,0.225345,0.38107
1,0.225345,0.375443
2,0.225345,0.367487
3,0.225345,0.3638
4,0.225345,0.361859


In [33]:
#reshape sample dataset
tem_x = tem_reg_df['T2'].values.reshape(-1,1)
tem_y = tem_reg_df['T6'].values.reshape(-1,1)

In [34]:
#split sample dataset into train and test sets
xtrain, xtest, ytrain, ytest = train_test_split(tem_x, tem_y, test_size=0.3, random_state=42)

In [35]:
#linear model
lin_regr = linear_model.LinearRegression()

# Train the model using the training sets
lin_regr.fit(xtrain, ytrain)

# Make predictions using the testing set
pred = lin_regr.predict(xtest)

In [36]:
#R-squared or Coefficient of determination
from sklearn.metrics import r2_score

r2_score = r2_score(ytest, pred)
round(r2_score, 2)

0.64