<a href="https://colab.research.google.com/github/RadarGem/Hamoye-Tag-Along-Codes/blob/main/Hamoye_Tag_Along_code_Stage_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Appliances Energy Prediction Dataset

The dataset for the remainder of this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:

Date, time year-month-day hour:minute:second

Appliances, energy use in Wh

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional

In [44]:
#import libraries for data manipulation, data wrangling, modelling and model evaluation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [68]:
#loading the dataset
df = pd.read_csv('/content/drive/MyDrive/energydata_complete.csv')
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [69]:
#Brief Data description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [47]:
#summary of the data
df.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [48]:
#Check Isnull dataset
df.isna().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

Question 1

The percent of the total variation of the dependent variable Y explained by the set of independent variables X is measured by

Answer: Coefficient of Determination

Question 2:

How do you define a Residual?

Answer: Y - Y^
 
Question 3:

The straight line graph of the equation Y = a + BX, the slope is horizontal if

Answer: b = 0

Question 4:

Which of the one is true about Heteroskedasticity?

Answer: Linear Regression with varying error terms

Question 5:

Generally, which of the following method(s) is used for predicting continuous dependent variables?

1. Linear Regression

2. Logistic Regression

Answer: 1 only

Question 6:

From the following options below, which of these is/are true about “Ridge” or “Lasso” regression methods in case of feature selection?

Answer: Lasso regression uses subset selection of features

Question 7:

Which of the following sentences is/are true about outliers in Linear Regression:

Answer: Linear regression is sensitive to outliers

Question 8

Which of the following metrics can be used for evaluating regression models?

1. R Squared

2. Adjusted R Squared

3. F Statistics

4. RMSE / MSE / MAE

Answer: 1, 2, 3 and 4

Question 9

A best fit line relating X and Y has a R-Squared value of 0.75. How do I interpret this information?

Answer: 75% of the variance in Y is explained by X

Question 10

Which of the following measures is optimal for comparing the goodness of the fit of competing regression models involving the same dependent variable?

Answer: Standard deviation of the residuals



Question 11

The Lasso can be interpreted as least-squares linear regression where:

Answer: Weights are regularized with the L1  andL2 norm

Question 12

From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

 Answer: 0.64

In [49]:
# Obtaining the Training set of the T2 and T6 columns from the normalised data
train_x = x_train[['T2']]
train_y = x_train[['T6']]

# Obtaining the Testing set of the T2 and T6 column from the normalised data  
test_x = x_test[['T2']]
test_y = x_test[['T6']]

# Instantiate the LinearRegression model
model = LinearRegression()

# Fit the model with training set
model.fit(train_x, train_y)

# Make predictions on test set
prediction = model.predict(test_x)

#Compute the r2_score on test set
R2_score = round(r2_score(test_y, prediction), 2)
R2_score

0.64

Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. 

Answer the following questions: 13 -20

In [70]:
df=df.drop(columns=['date', 'lights'])
df.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [71]:
#normalise the dataset using the MinMaxScaler
scaler = MinMaxScaler()
#normalise the features
normalised_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [72]:
 # splitting the data into features(predictors) and target(response) variables

#predictors
features_df = normalised_df.drop(['Appliances'], axis=1)

#target
target_var = normalised_df['Appliances']

In [73]:
# split the data set into training and testing set
x_train, x_test, y_train, y_test = train_test_split(features_df, target_var, test_size=0.3, random_state=42)

In [74]:
# create a dictionary of different algorithms 
models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'LinearRegression': LinearRegression()
}

In [55]:
#Using the helper funtion to compute the r2_score, RSS, RMSE, MAE, MSE on the testing set

def compute_score(models, x_train, x_test, y_train, y_test):
    """
    This function iteratively goes through all the models defined in the dictionary and 
    computes the r2_score, MSE, MAE and RMSE.
    Parameters: model, training set(x_train), test_set(x_test), train_labels(y_train), and test_labels(y_test).
    Returns: This funtion returns a dataFrame containing calculations of each models.
    """
    
    # store properties of each model
    model_properties = []
    
    # loop through the dictionary of models
    for reg_name, regressor in models.items():
        # empty dict for storing properties of each regression models
        reg_dict = {}
        # store the name of each model
        reg_dict['Name'] = reg_name
        # fit the regressor model
        regressor.fit(x_train, y_train)
        # compute the RSS
        reg_dict['RSS'] = round(np.sum(np.square(y_test - regressor.predict(x_test))), 2)
        # compute the r2_score
        reg_dict['r2_score'] = round(r2_score(y_test, regressor.predict(x_test)), 2)
        # compute the MAE
        reg_dict['MAE'] = round(mean_absolute_error(y_test, regressor.predict(x_test)), 2)
        #compute the mean_squared_error
        reg_dict['MSE'] = round(mean_squared_error(y_test, regressor.predict(x_test)), 3)
        # compute the RMSE
        reg_dict['RMSE'] = round(np.sqrt(mean_squared_error(y_test, regressor.predict(x_test))), 3)
        # append the properties of a each regressor to the model_properties list after every iteration
        model_properties.append(reg_dict)
     
    # create a dataframe with a list of all the model properties
    summary_df = pd.DataFrame(model_properties)
            
    return summary_df   

In [56]:
# execute the function
compute_score(models, x_train, x_test, y_train, y_test)

Unnamed: 0,Name,RSS,r2_score,MAE,MSE,RMSE
0,Ridge,45.42,0.15,0.05,0.008,0.088
1,Lasso,53.28,-0.0,0.06,0.009,0.095
2,LinearRegression,45.35,0.15,0.05,0.008,0.088


Question 13

Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

What is the Mean Absolute Error (in two decimal places)?
Answer: 0.05

In [57]:
# Lets instantiate the general LinearRegression Model to answer questions 13 - 17

# Instantiate the LinearRegression Model
lin_reg = LinearRegression()

# fit the model with training set
lin_reg.fit(x_train, y_train)

# Make predictions on testing set
y_pred = lin_reg.predict(x_test)
# Computing the Mean_absolute_error
MAE = round(mean_absolute_error(y_test, y_pred), 2)

MAE

0.05

Question 14

What is the Residual Sum of Squares (in two decimal places)?
Answer: 45.35

In [58]:
# Computing the Residual Sum of Squares in 2D.p
RSS = round(np.sum(np.square(y_test - y_pred)), 2)

RSS

45.35

Question 15

What is the Root Mean Squared Error (in three decimal places)?

Answer: 0.088

In [59]:
RMSE = round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)

RMSE

0.088

Question 16

What is the Coefficient of Determination (in two decimal places)?

Answer: 0.15

In [60]:
# Computing the r2_score
R2_score = r2_score(y_test, y_pred)
round(R2_score, 2)

0.15

Question 17

Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

Answer:  The features with the lowest and highest weights are RH_2, RH_1

In [61]:
def get_weights_df(model, feat, col_name):
  #this function returns the weight of every feature
  weights = pd.Series(model.coef_, feat.columns).sort_values()
  weights_df = pd.DataFrame(weights).reset_index()
  weights_df.columns = ['Features', col_name]
  weights_df[col_name].round(3)
  return weights_df

# Execute the get_weight function and store in dataframe
linear_weights_df = get_weights_df(lin_reg, x_train, 'Linear_weight')

linear_weights_df

Unnamed: 0,Features,Linear_weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


Question 18

Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?


Answer: No

In [75]:
# Instantuate ridge model with default aplha value
ridge = Ridge()

# Fitting the ridge model with training set
ridge.fit(x_train, y_train)

# making predictions on test set
pred = ridge.predict(x_test)

ridge_rmse = round(np.sqrt(mean_squared_error(y_test, pred)), 3)

print(f'The RMSE score for the rigde model is: {ridge_rmse}')


# Instantiate another ridge model with aplha value set to 0.4
ridge_reg = Ridge(alpha=0.4)

# fit on training set
ridge_reg.fit(x_train, y_train)

# Make new predictions on test set
new_pred = ridge_reg.predict(x_test)

ridge_rmse_new = round(np.sqrt(mean_squared_error(y_test, new_pred)), 3)
print(f'The RMSE score for the ridge model is: {ridge_rmse_new}')

print('There is no change')

The RMSE score for the rigde model is: 0.088
The RMSE score for the ridge model is: 0.088
There is no change


Question 19

Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?
Answer: 4

In [63]:
# Instatiate a lasso regressor with alpha value of 0.001
lasso_reg = Lasso(alpha=0.001)

# Fit the lasso model with training set
lasso_reg.fit(x_train, y_train)

# Make predictions on test set
lasso_preds = lasso_reg.predict(x_test)

# defining a function to get weights
def get_weights_df(model, feat, col_name):
  #this function returns the weight of every feature
  weights = pd.Series(model.coef_, feat.columns).sort_values()
  weights_df = pd.DataFrame(weights).reset_index()
  weights_df.columns = ['Features', col_name]
  weights_df[col_name].round(3)
  return weights_df

lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight')

# Lets obtain the dataFrame with non-zero feature weights
non_zero_weights = lasso_weights_df[lasso_weights_df['Lasso_weight'] != 0]

# view non_zero dataframe
print(non_zero_weights)

print('')

# print the total number of non_zero weights
print(f'The number of features with non-zero feature weights are equal to: {len(non_zero_weights)}')

     Features  Lasso_weight
0      RH_out     -0.049557
1        RH_8     -0.000110
24  Windspeed      0.002912
25       RH_1      0.017880

The number of features with non-zero feature weights are equal to: 4


Question 20

What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

Answer: 0.094

In [64]:
# Compute the new RMSE with lasso regression
RMSE_new = round(np.sqrt(mean_squared_error(y_test, lasso_preds)), 3)

RMSE_new

0.094