# Introduction
The Appliances Energy Prediction dataset provides valuable insights into the energy consumption of appliances in a house over a span of 4.5 months. The dataset combines information on temperature, humidity, weather conditions, and energy usage, gathered through a ZigBee wireless sensor network and energy meters. By analyzing this dataset, we can understand the factors influencing energy consumption and predict the energy consumption. The dataset contains variables such as temperature, humidity, and energy usage in different rooms, as well as external weather data. This analysis aims to enhance energy efficiency and inform energy management strategies in residential settings based on a comprehensive understanding of the dataset. 

# Feature Description
Date, time year-month-day hour:minute:second

Appliances, energy use in Wh

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional

# Importations

In [101]:
#data handling
import pandas as pd
import numpy as np

#Visualization
import seaborn as sns
import matplotlib.pyplot as plt

## Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge



# Data Loading

In [102]:
df = pd.read_csv("energydata_complete.csv",parse_dates = ["date"])

In [103]:
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [104]:
#dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         19735 non-null  datetime64[ns]
 1   Appliances   19735 non-null  int64         
 2   lights       19735 non-null  int64         
 3   T1           19735 non-null  float64       
 4   RH_1         19735 non-null  float64       
 5   T2           19735 non-null  float64       
 6   RH_2         19735 non-null  float64       
 7   T3           19735 non-null  float64       
 8   RH_3         19735 non-null  float64       
 9   T4           19735 non-null  float64       
 10  RH_4         19735 non-null  float64       
 11  T5           19735 non-null  float64       
 12  RH_5         19735 non-null  float64       
 13  T6           19735 non-null  float64       
 14  RH_6         19735 non-null  float64       
 15  T7           19735 non-null  float64       
 16  RH_7

In [105]:
#Checking for missing values
df.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

In [106]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Appliances,19735.0,97.694958,102.524891,10.0,50.0,60.0,100.0,1080.0
lights,19735.0,3.801875,7.935988,0.0,0.0,0.0,0.0,70.0
T1,19735.0,21.686571,1.606066,16.79,20.76,21.6,22.6,26.26
RH_1,19735.0,40.259739,3.979299,27.023333,37.333333,39.656667,43.066667,63.36
T2,19735.0,20.341219,2.192974,16.1,18.79,20.0,21.5,29.856667
RH_2,19735.0,40.42042,4.069813,20.463333,37.9,40.5,43.26,56.026667
T3,19735.0,22.267611,2.006111,17.2,20.79,22.1,23.29,29.236
RH_3,19735.0,39.2425,3.254576,28.766667,36.9,38.53,41.76,50.163333
T4,19735.0,20.855335,2.042884,15.1,19.53,20.666667,22.1,26.2
RH_4,19735.0,39.026904,4.341321,27.66,35.53,38.4,42.156667,51.09


# Exploratory Data Analysis

## Questions

Question 1 : Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

What is the Mean Absolute Error (in two decimal places)?

In [107]:
# Removing the columns date and lights columns
columns_to_remove = ["date", "lights"]
df = df.drop(columns=columns_to_remove)

In [108]:
# Normalizing the dataset using MinMaxScaler
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_normalized.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.046729,0.32735,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,...,0.223032,0.67729,0.37299,0.097674,0.894737,0.5,0.953846,0.538462,0.265449,0.265449
1,0.046729,0.32735,0.541326,0.225345,0.68214,0.215188,0.748871,0.351351,0.782437,0.175506,...,0.2265,0.678532,0.369239,0.1,0.894737,0.47619,0.894872,0.533937,0.372083,0.372083
2,0.037383,0.32735,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,...,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.037383,0.32735,0.52408,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,...,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.046729,0.32735,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,...,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611


In [109]:

# Split the dataset into features (X) and target variable (y)
X = df_normalized.drop(columns=["Appliances"])
y = df_normalized["Appliances"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [110]:
X_train.shape,y_train.shape

((13814, 26), (13814,))

In [111]:
X_test.shape,y_test.shape

((5921, 26), (5921,))

In [112]:
X_train.head()

Unnamed: 0,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
9129,0.49736,0.236767,0.12285,0.565939,0.373878,0.303474,0.476577,0.26476,0.408027,0.159533,...,0.475893,0.37638,0.16881,0.862791,0.776316,0.142857,0.984615,0.192308,0.724554,0.724554
2453,0.286167,0.482616,0.188999,0.669978,0.217957,0.735317,0.27027,0.691421,0.178691,0.333576,...,0.240375,0.703504,0.262594,0.836434,0.807018,0.142857,0.6,0.342383,0.864041,0.864041
9152,0.422386,0.230529,0.057427,0.60643,0.373878,0.338059,0.414414,0.236449,0.378404,0.151639,...,0.468262,0.409803,0.110397,0.853488,0.859649,0.095238,0.917949,0.158371,0.499502,0.499502
12694,0.560718,0.44684,0.280834,0.704002,0.51429,0.515189,0.540541,0.486556,0.509317,0.424604,...,0.561915,0.340784,0.444802,0.55969,0.75,0.119048,0.384615,0.558069,0.323173,0.323173
16952,0.835269,0.422071,1.0,0.318493,0.745383,0.459106,0.900901,0.516432,0.748845,0.455819,...,0.854318,0.633278,0.849946,0.530233,0.355263,0.142857,0.6,0.78733,0.34106,0.34106


In [113]:
y_test.head()

8980     0.028037
2754     0.074766
9132     0.037383
14359    0.037383
8875     0.056075
Name: Appliances, dtype: float64

In [114]:
X_train.head()

Unnamed: 0,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
9129,0.49736,0.236767,0.12285,0.565939,0.373878,0.303474,0.476577,0.26476,0.408027,0.159533,...,0.475893,0.37638,0.16881,0.862791,0.776316,0.142857,0.984615,0.192308,0.724554,0.724554
2453,0.286167,0.482616,0.188999,0.669978,0.217957,0.735317,0.27027,0.691421,0.178691,0.333576,...,0.240375,0.703504,0.262594,0.836434,0.807018,0.142857,0.6,0.342383,0.864041,0.864041
9152,0.422386,0.230529,0.057427,0.60643,0.373878,0.338059,0.414414,0.236449,0.378404,0.151639,...,0.468262,0.409803,0.110397,0.853488,0.859649,0.095238,0.917949,0.158371,0.499502,0.499502
12694,0.560718,0.44684,0.280834,0.704002,0.51429,0.515189,0.540541,0.486556,0.509317,0.424604,...,0.561915,0.340784,0.444802,0.55969,0.75,0.119048,0.384615,0.558069,0.323173,0.323173
16952,0.835269,0.422071,1.0,0.318493,0.745383,0.459106,0.900901,0.516432,0.748845,0.455819,...,0.854318,0.633278,0.849946,0.530233,0.355263,0.142857,0.6,0.78733,0.34106,0.34106


In [115]:
y_train.head()

9129     0.037383
2453     0.018692
9152     0.028037
12694    0.102804
16952    0.037383
Name: Appliances, dtype: float64

In [116]:
# Fitting a linear regression model on the training set

#Cloning the imported LinearRegression class
model = LinearRegression()

#Fitting
model.fit(X_train, y_train)

In [117]:
# Making predictions on the test set
y_pred = model.predict(X_test)

In [118]:
# Calculating the Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Print the Mean Absolute error rounded to 2 decimal places
print("{:.2f}".format(mae))


0.05


Questin 2 : What is the Residual Sum of Squares (in two decimal places)?

In [119]:
# Calculating the residuals
residuals = y_test - y_pred

# Calculating the Residual Sum of Squares (RSS)
rss = np.sum(residuals**2)

# Print the RSS rounded to 2 decimal places
print("{:.2f}".format(rss))


45.35


Question 3: What is the Root Mean Squared Error (in three decimal places)?

In [120]:
from sklearn.metrics import mean_squared_error

# Calculate the Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the RMSE rounded to 3 decimal places
print("{:.3f}".format(rmse))


0.088


Questin 4 : What is the Coefficient of Determination (in two decimal places)?

In [121]:
from sklearn.metrics import r2_score

# Calculate the coefficient of determination (R-squared)
r2 = r2_score(y_test, y_pred)

# Print the R-squared value rounded to 2 decimal places
print("{:.2f}".format(r2))


0.15


QUestion 5 : Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?*italicized text*

In [122]:
# Get the feature weights
feature_weights = model.coef_

In [123]:
# Creating a DataFrame to associate feature names with their weights
weights_df = pd.DataFrame({'Feature': X_train.columns, 'Weight': feature_weights})
weights_df

Unnamed: 0,Feature,Weight
0,T1,-0.003281
1,RH_1,0.553547
2,T2,-0.236178
3,RH_2,-0.456698
4,T3,0.290627
5,RH_3,0.096048
6,T4,0.028981
7,RH_4,0.026386
8,T5,-0.015657
9,RH_5,0.016006


In [124]:
# Sorting the DataFrame by weights in ascending order
weights_df = weights_df.sort_values('Weight')
weights_df

Unnamed: 0,Feature,Weight
3,RH_2,-0.456698
18,T_out,-0.32186
2,T2,-0.236178
16,T9,-0.189941
15,RH_8,-0.157595
20,RH_out,-0.077671
13,RH_7,-0.044614
17,RH_9,-0.0398
8,T5,-0.015657
0,T1,-0.003281


In [125]:

# Get the feature with the lowest weight
lowest_weight_feature = weights_df['Feature'].iloc[0]

# Get the feature with the highest weight
highest_weight_feature = weights_df['Feature'].iloc[-1]

# Print the features with the lowest and highest weights
print("Feature with the lowest weight:", lowest_weight_feature)
print("Feature with the highest weight:", highest_weight_feature)


Feature with the lowest weight: RH_2
Feature with the highest weight: RH_1


Question 6 : Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [126]:
from sklearn.linear_model import Ridge

# Train a Ridge regression model with alpha=0.4
ridge_model = Ridge(alpha=0.4)
ridge_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_ridge = ridge_model.predict(X_test)

# Calculate the RMSE for the Ridge model
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))

# Print the RMSE for the Ridge model rounded to 3 decimal places
print("RMSE (Ridge): {:.3f}".format(rmse_ridge))


RMSE (Ridge): 0.088


Quesion 7 : Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [127]:
from sklearn.linear_model import Lasso

# Train a Lasso regression model with alpha=0.001
lasso_model = Lasso(alpha=0.001)
lasso_model.fit(X_train, y_train)

In [128]:
# Get the feature weights from the Lasso model
lasso_feature_weights = lasso_model.coef_


In [129]:
# Count the number of features with non-zero weights
num_nonzero_features = np.count_nonzero(lasso_feature_weights)

# Print the number of features with non-zero weights
print("Number of features with non-zero weights:", num_nonzero_features)


Number of features with non-zero weights: 4


Questin 8 : What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

In [130]:
from sklearn.metrics import mean_squared_error

# Making predictions on the test set using the Lasso model
y_pred_lasso = lasso_model.predict(X_test)

In [131]:
# Calculate the RMSE for the Lasso model
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))

# Print the RMSE for the Lasso model rounded to 3 decimal places
print("RMSE (Lasso): {:.3f}".format(rmse_lasso))


RMSE (Lasso): 0.094
