# Stage B Quiz Solution

Oladimeji Williams
© ellipsis

---

I **Oladimeji WILLIAMS**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [Code of Conduct](https://drive.google.com/file/d/1sbR80aowp1daCnElwx3kNm0fxids0e6b/view) contained therein.


### Overview: Machine Learning: Regression - Predicting Energy Efficiency of Buildings
> The dataset for the remainder of this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:
- 1.   `date`, time year-month-day hour:minute:second
- 2.   `Appliances`, energy use in Wh
- 3.   `lights`, energy use of light fixtures in the house in Wh
- 4.   `T1`, Temperature in kitchen area, in Celsius
- 5.   `RH_1`, Humidity in kitchen area, in %
- 6.   `T2`, Temperature in living room area, in Celsius
- 7.   `RH_2`, Humidity in living room area, in %
- 8.   `T3`, Temperature in laundry room area
- 9.   `RH_3`, Humidity in laundry room area, in %
- 10.  `T4`, Temperature in office room, in Celsius
- 11.  `RH_4`, Humidity in office room, in %
- 12.  `T5`, Temperature in bathroom, in Celsius
- 13.  `RH_5`, Humidity in bathroom, in %
- 14.  `T6`, Temperature outside the building (north side), in Celsius
- 15.  `RH_6`, Humidity outside the building (north side), in %
- 16.  `T7`, Temperature in ironing room , in Celsius
- 17.  `RH_7`, Humidity in ironing room, in %
- 18.  `T8`, Temperature in teenager room 2, in Celsius
- 19.  `RH_8`, Humidity in teenager room 2, in %
- 20.  `T9`, Temperature in parents room, in Celsius
- 21.  `RH_9`, Humidity in parents room, in %
- 22.  `T_out`, Temperature outside (from Chievres weather station), in Celsius
- 23.  `Press_mm_hg` (from Chievres weather station), in mm Hg
- 24.  `RH_out`, Humidity outside (from Chievres weather station), in %
- 25.  `Windspeed` (from Chievres weather station), in m/s
- 26.  `Visibility` (from Chievres weather station), in km
- 27.  `Tdewpoint` (from Chievres weather station), Â°C
- 28.  `rv1`, Random variable 1, nondimensional
- 29.  `rv2`, Random variable 2, nondimensional

# Preliminaries

In [1]:
# Load All Possible Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from statsmodels.graphics.correlation import plot_corr
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Load the dateset
df = pd.read_csv(r"energydata_complete.csv", parse_dates=[0])

In [3]:
# Copy the dataset into another dataframe
df_copy = df.copy()

In [4]:
# Peak the first few observations of the dataset
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [5]:
# Peak the datatypes
df.dtypes

date           datetime64[ns]
Appliances              int64
lights                  int64
T1                    float64
RH_1                  float64
T2                    float64
RH_2                  float64
T3                    float64
RH_3                  float64
T4                    float64
RH_4                  float64
T5                    float64
RH_5                  float64
T6                    float64
RH_6                  float64
T7                    float64
RH_7                  float64
T8                    float64
RH_8                  float64
T9                    float64
RH_9                  float64
T_out                 float64
Press_mm_hg           float64
RH_out                float64
Windspeed             float64
Visibility            float64
Tdewpoint             float64
rv1                   float64
rv2                   float64
dtype: object

## Question 12

In [6]:
# Build ols model
formula = df.columns[13]+' ~ '+ df.columns[5]; formula

'T6 ~ T2'

In [7]:
ols_model = smf.ols(formula=formula, data=df)
fitted = ols_model.fit()
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                     T6   R-squared:                       0.642
Model:                            OLS   Adj. R-squared:                  0.642
Method:                 Least Squares   F-statistic:                 3.537e+04
Date:                Tue, 09 Aug 2022   Prob (F-statistic):               0.00
Time:                        15:30:46   Log-Likelihood:                -53524.
No. Observations:               19735   AIC:                         1.071e+05
Df Residuals:                   19733   BIC:                         1.071e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -37.3495      0.242   -154.306      0.0

In [8]:
# Feature Selection for Linear Regression
X = df[["T2"]]
y = df["T6"]

In [9]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [10]:
# Build a Linear Regression Model
linear_model1 = LinearRegression()
linear_model1.fit(X_train, y_train)

LinearRegression()

In [11]:
y_pred = linear_model1.predict(X_test)
print(f"r squared is: {round(r2_score(y_test, y_pred), 2)}")

r squared is: 0.64


## Question 13

In [12]:
df1 = df.drop(["date"], axis=1)

In [13]:
# Normalize the dataset with MinMaxScaler
scaler = MinMaxScaler()
normalised_df = pd.DataFrame(scaler.fit_transform(df1), columns=df1.columns)

In [14]:
X = normalised_df.drop(["lights", "Appliances"], axis=1)
y = normalised_df["Appliances"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.3, random_state=42)

In [16]:
linear_model2 = LinearRegression()
linear_model2.fit(X_train, y_train)

LinearRegression()

In [17]:
y_pred = linear_model2.predict(X_test)
print(f"mean absolute error is: {round(mean_absolute_error(y_test, y_pred), 2)}")

mean absolute error is: 0.05


## Question 14

In [18]:
print(f"residual sum of squares is : {round(np.sum(np.square(y_test - y_pred)), 2)}")

residual sum of squares is : 45.35


## Question 15

In [19]:
print(f"Root mean square error: {round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)}")

Root mean square error: 0.088


## Question 16

In [20]:
print(f"coefficient of determination (r squared) is: {round(r2_score(y_test, y_pred), 2)}")

coefficient of determination (r squared) is: 0.15


## Question 17

In [21]:
lasso_model = Lasso(alpha = 0.001)
lasso_model.fit(X_train, y_train)

Lasso(alpha=0.001)

In [22]:
linear_model2.coef_

array([-3.28105119e-03,  5.53600943e-01, -2.36361693e-01, -4.56946743e-01,
        2.90752354e-01,  9.60962589e-02,  2.89831935e-02,  2.63980162e-02,
       -1.56478930e-02,  1.60304394e-02,  2.36484798e-01,  3.80719509e-02,
        1.03030569e-02, -4.45570983e-02,  1.02021226e-01, -1.57630492e-01,
       -1.89934576e-01, -3.98188192e-02, -3.21937728e-01,  6.87498350e-03,
       -7.76936097e-02,  2.91998623e-02,  1.22882728e-02,  1.17813640e-01,
       -1.19940039e+11,  1.19940039e+11])

In [23]:
ols_model = sm.OLS(y_train, X_train).fit()
ols_model.summary()

0,1,2,3
Dep. Variable:,Appliances,R-squared (uncentered):,0.503
Model:,OLS,Adj. R-squared (uncentered):,0.502
Method:,Least Squares,F-statistic:,558.4
Date:,"Tue, 09 Aug 2022",Prob (F-statistic):,0.0
Time:,15:30:46,Log-Likelihood:,13799.0
No. Observations:,13814,AIC:,-27550.0
Df Residuals:,13789,BIC:,-27360.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
T1,-0.0152,0.020,-0.767,0.443,-0.054,0.024
RH_1,0.5385,0.028,19.321,0.000,0.484,0.593
T2,-0.2069,0.025,-8.228,0.000,-0.256,-0.158
RH_2,-0.3992,0.030,-13.448,0.000,-0.457,-0.341
T3,0.2894,0.014,19.980,0.000,0.261,0.318
RH_3,0.1004,0.016,6.092,0.000,0.068,0.133
T4,0.0328,0.012,2.643,0.008,0.008,0.057
RH_4,0.0187,0.017,1.120,0.263,-0.014,0.051
T5,-0.0301,0.014,-2.190,0.029,-0.057,-0.003

0,1,2,3
Omnibus:,9902.115,Durbin-Watson:,2.021
Prob(Omnibus):,0.0,Jarque-Bera (JB):,154535.153
Skew:,3.367,Prob(JB):,0.0
Kurtosis:,17.938,Cond. No.,9340000000000000.0


## Question 18

In [24]:
ridge_model = Ridge(alpha=0.4)
ridge_model.fit(X_train, y_train)

Ridge(alpha=0.4)

In [25]:
y_pred = ridge_model.predict(X_test)
print(f"Root mean square error: {round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)}")

Root mean square error: 0.088


## Question 19

In [26]:
lasso_model = Lasso(alpha = 0.001)
lasso_model.fit(X_train, y_train)

Lasso(alpha=0.001)

In [27]:
lasso_model.coef_

array([ 0.        ,  0.01787993,  0.        , -0.        ,  0.        ,
        0.        , -0.        ,  0.        , -0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        ,  0.        ,
       -0.00011004, -0.        , -0.        ,  0.        , -0.        ,
       -0.04955749,  0.00291176,  0.        ,  0.        , -0.        ,
       -0.        ])

## Question 20

In [28]:
y_pred = lasso_model.predict(X_test)
print(f"Root mean square error: {round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)}")

Root mean square error: 0.094
