__Problem Definition:__ Given the dataset - Appliances Energy Consumption - we are to develop a multiple linear regression model that can predict new energy consumption of an appliance when given new test data. To answer some questions, we will need to normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility).

In [14]:
#import libraries required libraries for data manipulation and data wrangling

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#import required libraries for modelling and model evaluation
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [15]:
#loading the dataset
energy_df = pd.read_csv('./energydata_complete.csv')
energy_df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [16]:
#descriptive summary of the data
energy_df.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [17]:
# remove lights column since there are more zero entities as seen from the descriptive statistics above
#also, remove date column since this is not a time-series problem
energy_df = energy_df.drop(columns=['date', 'lights'])
energy_df.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [18]:
#normalise the dataset using the MinMaxScaler

#instantiate the scaler
scaler = MinMaxScaler()

#normalise the features
normalised_df = pd.DataFrame(scaler.fit_transform(energy_df), columns=energy_df.columns)

In [19]:
# splitting the data into features(predictors) and target(response) variables

#predictors
features_df = normalised_df.drop(['Appliances'], axis=1)

#target
target_var = normalised_df['Appliances']

In [25]:
# split the data set into training and testing set
x_train, x_test, y_train, y_test = train_test_split(features_df, target_var, test_size=0.3, random_state=42)

In [26]:
# create a dictionary of different algorithms 
models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'LinearRegression': LinearRegression()
}

In [27]:
# helper funtion to compute the r2_score, RSS, RMSE, MAE, MSE on the testing set

def compute_score(models, x_train, x_test, y_train, y_test):
    """
    This function iteratively goes through all the models defined in the dictionary and 
    computes the r2_score, MSE, MAE and RMSE.
    Parameters: model, training set(x_train), test_set(x_test), train_labels(y_train), and test_labels(y_test).
    Returns: This funtion returns a dataFrame containing calculations of each models.
    """
    
    # store properties of each model
    model_properties = []
    
    # loop through the dictionary of models
    for reg_name, regressor in models.items():
        # empty dict for storing properties of each regression models
        reg_dict = {}
        # store the name of each model
        reg_dict['Name'] = reg_name
        # fit the regressor model
        regressor.fit(x_train, y_train)
        # compute the RSS
        reg_dict['RSS'] = round(np.sum(np.square(y_test - regressor.predict(x_test))), 2)
        # compute the r2_score
        reg_dict['r2_score'] = round(r2_score(y_test, regressor.predict(x_test)), 2)
        # compute the MAE
        reg_dict['MAE'] = round(mean_absolute_error(y_test, regressor.predict(x_test)), 2)
        #compute the mean_squared_error
        reg_dict['MSE'] = round(mean_squared_error(y_test, regressor.predict(x_test)), 3)
        # compute the RMSE
        reg_dict['RMSE'] = round(np.sqrt(mean_squared_error(y_test, regressor.predict(x_test))), 3)
        # append the properties of a each regressor to the model_properties list after every iteration
        model_properties.append(reg_dict)
     
    # create a dataframe with a list of all the model properties
    summary_df = pd.DataFrame(model_properties)
            
    return summary_df    

In [28]:
# execute the function
compute_score(models, x_train, x_test, y_train, y_test)

Unnamed: 0,Name,RSS,r2_score,MAE,MSE,RMSE
0,Ridge,45.42,0.15,0.05,0.008,0.088
1,Lasso,53.28,-0.0,0.06,0.009,0.095
2,LinearRegression,45.35,0.15,0.05,0.008,0.088


# Quiz Questions

__Question 1:__ In the Linear regression, L2 regularization is equivalent to imposing a:
__Answer:__ Gaussian prior.

__Question 2:__ Cross validation:
__Answer:__ Is quaranteed to prevent overfitting.

__Question 3:__ Ridge Regression:
__Answer:__ Reduces variance at the expence of higher bias.

__Question 4:__ In the different terms of the bias-tradeoff, which of the following is substantially more harmful to the test error than the training error?
__Answer:__ Variance.

__Question 5:__ What can you use to find the best fit line for Linear Regression?
__Answer:__ Least Square Error.(The linear Regression uses the least square methode as its cost funtion adn aims to reduce this cost funtion i.e reducing the distance between the actual point and the line of best fit.)

__Question 6:__ Which of the following is true about outliers in linear regression?
__Answer:__ Linear regression is sensitive to outliers. (The slope or gradient of the regression line will change due to outliers in most of the cases hence Linear Regression is definately sensitive to outliers.)

__Question 7:__ How many coefficients do you need to estimate a simple linear regression model(One, independent variable)?
__Answer:__ 2. (y = mx + c) where m and c are the coefficients of regression.

__Question 8:__ Adding more bias functions in a linear model:
__Answer:__ Decreases model bias.

__Question 9:__ A best fit line relating X and Y has a R-squared value of 0.75. How do I interpret this information?
__Answer:__ 75% of the variance in Y is explained by X.

__Question 10:__ The Lasso can be interpreted as least-squares regression where:
__Answer:__ Weights are regularized with the L1 norm.

__Question 11:__ Which of these assumptions of Linear Regression?
__Answer:__ Multivariate normality

__Question 12:__ From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two D.P?
__Answer:__ 0.64

In [30]:
# Obtaining the Training set of the T2 and T6 columns from the normalised data
train_x = x_train[['T2']]
train_y = x_train[['T6']]

# Obtaining the Testing set of the T2 and T6 column from the normalised data  
test_x = x_test[['T2']]
test_y = x_test[['T6']]

# Instantiate the LinearRegression model
model = LinearRegression()

# Fit the model with training set
model.fit(train_x, train_y)

# Make predictions on test set
prediction = model.predict(test_x)

#Compute the r2_score on test set
R2_score = round(r2_score(test_y, prediction), 2)
R2_score

0.64

__Question 13:__ What is the mean absolute error

__Answer:__ 0.05

In [31]:
# Lets instantiate the general LinearRegression Model to answer questions 13 - 17

# Instantiate the LinearRegression Model
lin_reg = LinearRegression()

# fit the model with training set
lin_reg.fit(x_train, y_train)

# Make predictions on testing set
y_pred = lin_reg.predict(x_test)

In [32]:
# Computing the Mean_absolute_error
MAE = round(mean_absolute_error(y_test, y_pred), 2)

MAE

0.05

__Question 14:__ What is the Residual Sum of Squares(in two decimal places)

__Answer:__ 45.35

In [34]:
# Computing the Residual Sum of Squares in 2D.p
RSS = round(np.sum(np.square(y_test - y_pred)), 2)

RSS

45.35

__Question 15:__ What is the Root Mean Suared Error(in three decimal places)

__Answer:__ 0.088

In [35]:
RMSE = round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)

RMSE

0.088

__Question 16:__ What is the Coefficient of Determination(in two decimal places)

__Answer:__ 0.15

In [36]:
# Computing the r2_score
R2_score = r2_score(y_test, y_pred)
round(R2_score, 2)

0.15

__Question 17:__ Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

__Answer:__ We can see from the Table below that the features with the lowest and highest weights are RH_2, RH_1 respectively.

In [37]:
def get_weights_df(model, feat, col_name):
  #this function returns the weight of every feature
  weights = pd.Series(model.coef_, feat.columns).sort_values()
  weights_df = pd.DataFrame(weights).reset_index()
  weights_df.columns = ['Features', col_name]
  weights_df[col_name].round(3)
  return weights_df

# Execute the get_weight function and store in dataframe
linear_weights_df = get_weights_df(lin_reg, x_train, 'Linear_weight')

linear_weights_df

Unnamed: 0,Features,Linear_weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


__Question 18:__ Train a Ridge Regression model with an alpha value of 0.4. Is there any change to the root mean squared error(RMSE) when evaluated on the test set?

__Answer:__ From the calculations below, there are no differences, hence the answer is No.

In [38]:
# Instantuate ridge model with default aplha value
ridge = Ridge()

# Fitting the ridge model with training set
ridge.fit(x_train, y_train)

# making predictions on test set
pred = ridge.predict(x_test)

ridge_rmse = round(np.sqrt(mean_squared_error(y_test, pred)), 3)

print(f'The RMSE score for the rigde model with default aplha value when evaluated on the test set is: {ridge_rmse}')


# Instantiate another ridge model with aplha value set to 0.4
ridge_reg = Ridge(alpha=0.4)

# fit on training set
ridge_reg.fit(x_train, y_train)

# Make new predictions on test set
new_pred = ridge_reg.predict(x_test)

ridge_rmse_new = round(np.sqrt(mean_squared_error(y_test, new_pred)), 3)

print(f'The RMSE score for the rigde model with default aplha value when evaluated on the test set is: {ridge_rmse_new}')

print('There is no change in the RMSE scores')

The RMSE score for the rigde model with default aplha value when evaluated on the test set is: 0.088
The RMSE score for the rigde model with default aplha value when evaluated on the test set is: 0.088
There is no change in the RMSE scores


__Question 19:__ Train a Lasso regression model with an aplha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

__Answer: 4__

In [39]:
# Instatiate a lasso regressor with alpha value of 0.001
lasso_reg = Lasso(alpha=0.001)

# Fit the lasso model with training set
lasso_reg.fit(x_train, y_train)

# Make predictions on test set
lasso_preds = lasso_reg.predict(x_test)

# defining a function to get weights
def get_weights_df(model, feat, col_name):
  #this function returns the weight of every feature
  weights = pd.Series(model.coef_, feat.columns).sort_values()
  weights_df = pd.DataFrame(weights).reset_index()
  weights_df.columns = ['Features', col_name]
  weights_df[col_name].round(3)
  return weights_df

lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight')

# Lets obtain the dataFrame with non-zero feature weights
non_zero_weights = lasso_weights_df[lasso_weights_df['Lasso_weight'] != 0]

# view non_zero dataframe
print(non_zero_weights)

print('')

# print the total number of non_zero weights
print(f'The total features with non-zero feature weights are equal to: {len(non_zero_weights)}')

     Features  Lasso_weight
0      RH_out     -0.049557
1        RH_8     -0.000110
24  Windspeed      0.002912
25       RH_1      0.017880

The total features with non-zero feature weights are equal to: 4


__Question 20:__ What is the new RMSE with Lasso Regression (in 3 decimal places)

__Answer:__ 0.094

In [40]:
# Compute the new RMSE with lasso regression
RMSE_new = round(np.sqrt(mean_squared_error(y_test, lasso_preds)), 3)

RMSE_new

0.094