## Appliances Energy Prediction

In [91]:
# Import Libraries

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [52]:
# Load Appliances Energy Prediction Data

energy_df = pd.read_csv('Dataset/energydata_complete.csv', parse_dates=['date'])
energy_df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


## Dataset Information

https://archive.ics.uci.edu/ml/machine-learning-databases/00374

The dataset is Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. 

Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

<table>
<thead><tr>
<th><strong>Feature Name</strong></th>
<th><strong>Description</strong></th>
<th><strong>Data Type</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>date</td>
<td>year-month-day hour:minute:second</td>
<td>datetime64[ns]</td>
</tr>
<tr>
<td>Appliances</td>
<td>energy use in Wh</td>
<td>int64</td>
</tr>
<tr>
<td>lights</td>
<td>energy use of light fixtures in the house in Wh</td>
<td>int64</td>
</tr>
<tr>
<td>T1</td>
<td>Temperature in kitchen area, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_1</td>
<td>Humidity in kitchen area, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T2</td>
<td>Temperature in living room area, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_2</td>
<td>Humidity in living room area, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T3</td>
<td>Temperature in laundry room area</td>
<td>float64</td>
</tr>
<tr>
<td>RH_3</td>
<td>Humidity in laundry room area, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T4</td>
<td>Temperature in office room, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_4</td>
<td>Humidity in office room, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T5</td>
<td>Temperature in bathroom, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_5</td>
<td>Humidity in bathroom, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T6</td>
<td>Temperature outside the building (north side), in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_6</td>
<td>Humidity outside the building (north side), in %</td>
<td>float64</td>
</tr>
<tr>
<td>T7</td>
<td>Temperature in ironing room , in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_7</td>
<td>Humidity in ironing room, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T8</td>
<td>Temperature in teenager room 2, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_8</td>
<td>Humidity in teenager room 2, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T9</td>
<td>Temperature in parents room, in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>RH_9</td>
<td>Humidity in parents room, in %</td>
<td>float64</td>
</tr>
<tr>
<td>T_out</td>
<td>Temperature outside (from Chievres weather station), in Celsius</td>
<td>float64</td>
</tr>
<tr>
<td>Press_mm_hg</td>
<td>Pressure (from Chievres weather station), in mm Hg</td>
<td>float64</td>
</tr>
<tr>
<td>RH_out</td>
<td>Humidity outside (from Chievres weather station), in %</td>
<td>float64</td>
</tr>
<tr>
<tr>
<td>Windspeed</td>
<td>Wind speed (from Chievres weather station), in m/s</td>
<td>float64</td>
</tr>
<tr>
<td>Visibility</td>
<td>Visibility (from Chievres weather station), in km</td>
<td>float64</td>
</tr>
<tr>
<td>Tdewpoint</td>
<td>Tdewpoint (from Chievres weather station), in °C</td>
<td>float64</td>
</tr>
<tr>
<td>rv1</td>
<td>Random variable 1, nondimensional</td>
<td>float64</td>
</tr>
<tr>
<td>rv2</td>
<td>Random variable 2, nondimensional</td>
<td>float64</td>
</tr>
<tr>
</tbody>
</table>

## Question 17

From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the Root Mean Squared error in three D.P?

In [55]:
df_new = energy_df.drop(columns=['date'])

In [60]:
X = df_new['T2'].values.reshape(-1,1) # reshape to 2D array
y = df_new['T6']

In [61]:
linear_model = LinearRegression()
linear_model.fit(X, y)

# Making predictions
y_pred = linear_model.predict(X)

In [62]:
# Calculating RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
round(rmse, 3)

3.644

## Question 18

Remove the following columns: [“date”, “lights”]. The target variable is `Appliances`. Use a 70-30 train-test set split with a  random state of 42 (for reproducibility). Normalize the dataset using the MinMaxScaler (Hint: Use the MinMaxScaler fit_transform and transform methods on the train and test set respectively). Run a multiple linear regression using the training set. Answer the following questions:

What is the Mean Absolute Error (in three decimal places) for the  training set?

In [63]:
df_drop = energy_df.drop(columns=['date', 'lights'])
df_drop.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [64]:
X = df_drop.drop(columns=['Appliances'])
y = df_drop['Appliances']

In [65]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [68]:
# Scale the features using MinMaxScaler
scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [69]:
# Creating and fitting the model
model = LinearRegression()
model.fit(x_train_scaled, y_train)

# Making predictions on the training set
y_train_pred = model.predict(x_train_scaled)

In [70]:
# Calculating MAE for the training set
mae_train = mean_absolute_error(y_train, y_train_pred)
round(mae_train, 3)

53.742

## Question 19

What is the Root Mean Squared Error (in three decimal places) for the training set?

In [71]:
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
round(rmse_train, 3)

95.216

## Question 20

What is the Mean Absolute Error (in three decimal places) for test set?

In [75]:
# Obtain predictions on the test set
predicted_values_scaled = model.predict(x_test_scaled)

In [76]:
mae = mean_absolute_error(y_test, predicted_values_scaled)
round(mae, 3)

53.643

## Question 21

What is the Root Mean Squared Error (in three decimal places) for test set?

In [77]:
rmse = np.sqrt(mean_squared_error(y_test, predicted_values_scaled))
round(rmse, 3)

93.641

## Question 22

Did the Model above overfit to the training set?

- RMSE for training set: 95.216
- RMSE for test set: 93.641

No. Given that the RMSE on the test set is slightly lower than the RMSE on the training set, it suggests that the model generalizes well to unseen data. This is a positive sign as it indicates that the model is not overly fitting to the training data.

## Question 23

Train a ridge regression model with default parameters. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [84]:
ridge_reg = Ridge()
ridge_reg.fit(x_train_scaled, y_train)

In [85]:
# Make predictions on the test set
y_test_pred_ridge = ridge_reg.predict(x_test_scaled)

In [86]:
# Calculate RMSE for the test set
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge))
round(rmse_ridge, 3)

93.709

Slight change from 93.641(linear regression) to 93.709(ridge regression). Tuning hyperparameters may improve ridge model.

## Question 24

Train a lasso regression model with default value and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [95]:
lasso_reg = Lasso()
lasso_reg.fit(x_train_scaled, y_train)

In [92]:
# comparing the effects of regularisation

def get_weights_df(model, feat, col_name):

  # this function returns the weight of every feature

  weights = pd.Series(model.coef_, feat.columns).sort_values()
  weights_df = pd.DataFrame(weights).reset_index()
  weights_df.columns = ['Features', col_name]
  weights_df[col_name].round(3)
  return weights_df

linear_model_weights = get_weights_df(model, x_train, 'Linear_Model_Weight')
ridge_weights_df = get_weights_df(ridge_reg, x_train, 'Ridge_Weight')
lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight')

final_weights = pd.merge(linear_model_weights, ridge_weights_df, on='Features')
final_weights = pd.merge(final_weights, lasso_weights_df, on='Features')
final_weights

Unnamed: 0,Features,Linear_Model_Weight,Ridge_Weight,Lasso_weight
0,rv1,-8125910000000.0,0.775954,-0.0
1,RH_2,-469.5204,-380.264856,-0.0
2,T_out,-344.3917,-220.546136,0.0
3,T2,-252.711,-181.11524,0.0
4,T9,-203.2405,-198.646042,-0.0
5,RH_8,-168.6277,-166.246807,-26.102888
6,RH_out,-83.10943,-36.798983,-50.293976
7,RH_7,-47.56344,-50.215392,-0.0
8,RH_9,-42.58525,-45.543329,-0.0
9,T5,-16.74855,-25.703802,-0.0


4 features have non-zero feature weights for lasso regression

## Question 25

What is the new RMSE with the Lasso Regression on the test set?

In [96]:
y_test_pred_lasso = lasso_reg.predict(x_test_scaled)

In [97]:
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_test_pred_lasso))
round(rmse_lasso, 3)

99.424