## Week 10 Assignment

<b>Energy Efficiency</b>

Description: Multi-Linear and Polynomial Regression on the Energy Efficiency Dataset
In this assignment, you will perform multi-linear and polynomial regression on the Energy
Efficiency dataset to predict the heating load (y1) of buildings. Follow the instructions below:

1. Load the Energy Efficiency dataset using the pandas library.
    
    ● Dataset Name: Energy Dataset

2. Apply necessary preprocessing steps on the dataset, such as handling missing values, scaling features, or encoding categorical variables if required.

3. Separate the features (X) and the target variable (y: heating load) from the dataset.

4. Split the dataset into training and testing sets using an 80:20 ratio.

5. Perform multi-linear regression:

    ● Fit a multi-linear regression model to the training data using the LinearRegression class from the sklearn.linear_model module.
    
    ● Predict the heating load for the testing data using the trained model.
                              
    ● Evaluate the performance of the model by calculating metrics such as mean squared error (MSE) and coefficient of determination (R^2).
    
    ● Print the MSE and R^2 values to assess the model's accuracy.

6. Perform polynomial regression:

    ● Use the PolynomialFeatures class from the sklearn.preprocessing module to transform the features into polynomial features.

    ● Fit a polynomial regression model to the training data using the LinearRegression class.
    
    ● Predict the heating load for the testing data using the trained polynomial regression model.

    ● Evaluate the performance of the model by calculating MSE and R^2.

    ● Print the MSE and R^2 values.

7. Compare the performance of the multi-linear regression and polynomial regression models based on the MSE and R^2 values.
    
Energy Data sent

Link-https://docs.google.com/spreadsheets/d/1jXngyixNhyj7C6yj5olExZQeWWwzSUja/edit?usp=sharing&ouid=111885139572109362769&rtpof=true&sd=true

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error

In [2]:
energy_data = pd.read_excel("Energy Dataset.xlsx")
energy_data

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Heating Load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84
...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48


In [3]:
#Checking for null values
energy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   X1            768 non-null    float64
 1   X2            768 non-null    float64
 2   X3            768 non-null    float64
 3   X4            768 non-null    float64
 4   X5            768 non-null    float64
 5   X6            768 non-null    int64  
 6   X7            768 non-null    float64
 7   X8            768 non-null    int64  
 8   Heating Load  768 non-null    float64
dtypes: float64(7), int64(2)
memory usage: 54.1 KB


In [4]:
energy_data.isnull().sum() #No null values present

X1              0
X2              0
X3              0
X4              0
X5              0
X6              0
X7              0
X8              0
Heating Load    0
dtype: int64

In [5]:
#Checking for constant columns
var_T = VarianceThreshold()

In [6]:
var_T.fit(energy_data)

In [7]:
var_T.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

In [8]:
energy_data.columns[var_T.get_support() == True]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'Heating Load'], dtype='object')

In [9]:
energy_data.columns[var_T.get_support() == False] #No constant columns present

Index([], dtype='object')

In [10]:
#Checking for categorical columns which need to be encoded
energy_data.columns[energy_data.dtypes == "object"]

Index([], dtype='object')

In [11]:
energy_data.columns[energy_data.dtypes != "object"]
#No categorical column present in dataset, all are numerical columns

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'Heating Load'], dtype='object')

In [12]:
X = energy_data.drop("Heating Load",axis = 1)
X

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0
...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5


In [13]:
y = energy_data["Heating Load"]
y

0      15.55
1      15.55
2      15.55
3      15.55
4      20.84
       ...  
763    17.88
764    16.54
765    16.44
766    16.48
767    16.64
Name: Heating Load, Length: 768, dtype: float64

In [14]:
#Scaling the feature columns
scaler = StandardScaler()
data_transformed = scaler.fit_transform(X)
data_transformed

array([[ 2.04177671, -1.78587489, -0.56195149, ..., -1.34164079,
        -1.76044698, -1.81457514],
       [ 2.04177671, -1.78587489, -0.56195149, ..., -0.4472136 ,
        -1.76044698, -1.81457514],
       [ 2.04177671, -1.78587489, -0.56195149, ...,  0.4472136 ,
        -1.76044698, -1.81457514],
       ...,
       [-1.36381225,  1.55394308,  1.12390297, ..., -0.4472136 ,
         1.2440492 ,  1.41133622],
       [-1.36381225,  1.55394308,  1.12390297, ...,  0.4472136 ,
         1.2440492 ,  1.41133622],
       [-1.36381225,  1.55394308,  1.12390297, ...,  1.34164079,
         1.2440492 ,  1.41133622]])

In [15]:
data_transformed.shape

(768, 8)

In [16]:
X = pd.DataFrame(data_transformed,columns=X.columns)
X

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.341641,-1.760447,-1.814575
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-0.447214,-1.760447,-1.814575
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,0.447214,-1.760447,-1.814575
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,1.341641,-1.760447,-1.814575
4,1.284979,-1.229239,0.000000,-1.198678,1.0,-1.341641,-1.760447,-1.814575
...,...,...,...,...,...,...,...,...
763,-1.174613,1.275625,0.561951,0.972512,-1.0,1.341641,1.244049,1.411336
764,-1.363812,1.553943,1.123903,0.972512,-1.0,-1.341641,1.244049,1.411336
765,-1.363812,1.553943,1.123903,0.972512,-1.0,-0.447214,1.244049,1.411336
766,-1.363812,1.553943,1.123903,0.972512,-1.0,0.447214,1.244049,1.411336


### Multilinear Regression

In [17]:
#Splitting the training and the test data
xtrain, xtest, ytrain, ytest= train_test_split(X, y, test_size=0.2)

In [18]:
xtrain.shape, ytrain.shape

((614, 8), (614,))

In [19]:
xtest.shape, ytest.shape

((154, 8), (154,))

In [20]:
#Training the model
linReg = LinearRegression()
linReg.fit(xtrain, ytrain)

In [21]:
# for testing the model
ypred = linReg.predict(xtest)

In [22]:
# to check accuracy of model
r2_score(y_pred=ypred,y_true=ytest)

0.9284783794052999

In [23]:
mean_squared_error(y_pred=ypred,y_true=ytest)

8.11243748446275

### Polynomial Regression

In [24]:
#Applying PolynomialFeatures on X
p_transform = PolynomialFeatures(degree = 2) #Taking Degree as 2
x_transformed = p_transform.fit_transform(X)
x_transformed

array([[ 1.        ,  2.04177671, -1.78587489, ...,  3.09917355,
         3.19446331,  3.29268293],
       [ 1.        ,  2.04177671, -1.78587489, ...,  3.09917355,
         3.19446331,  3.29268293],
       [ 1.        ,  2.04177671, -1.78587489, ...,  3.09917355,
         3.19446331,  3.29268293],
       ...,
       [ 1.        , -1.36381225,  1.55394308, ...,  1.5476584 ,
         1.75577169,  1.99186992],
       [ 1.        , -1.36381225,  1.55394308, ...,  1.5476584 ,
         1.75577169,  1.99186992],
       [ 1.        , -1.36381225,  1.55394308, ...,  1.5476584 ,
         1.75577169,  1.99186992]])

In [25]:
x_transformed.shape

(768, 45)

In [26]:
x_transformed = pd.DataFrame(x_transformed,columns = p_transform.get_feature_names_out(input_features=X.columns))
x_transformed

Unnamed: 0,1,X1,X2,X3,X4,X5,X6,X7,X8,X1^2,...,X5^2,X5 X6,X5 X7,X5 X8,X6^2,X6 X7,X6 X8,X7^2,X7 X8,X8^2
0,1.0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.341641,-1.760447,-1.814575,4.168852,...,1.0,-1.341641,-1.760447,-1.814575,1.8,2.361887,2.434508,3.099174,3.194463,3.292683
1,1.0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-0.447214,-1.760447,-1.814575,4.168852,...,1.0,-0.447214,-1.760447,-1.814575,0.2,0.787296,0.811503,3.099174,3.194463,3.292683
2,1.0,2.041777,-1.785875,-0.561951,-1.470077,1.0,0.447214,-1.760447,-1.814575,4.168852,...,1.0,0.447214,-1.760447,-1.814575,0.2,-0.787296,-0.811503,3.099174,3.194463,3.292683
3,1.0,2.041777,-1.785875,-0.561951,-1.470077,1.0,1.341641,-1.760447,-1.814575,4.168852,...,1.0,1.341641,-1.760447,-1.814575,1.8,-2.361887,-2.434508,3.099174,3.194463,3.292683
4,1.0,1.284979,-1.229239,0.000000,-1.198678,1.0,-1.341641,-1.760447,-1.814575,1.651171,...,1.0,-1.341641,-1.760447,-1.814575,1.8,2.361887,2.434508,3.099174,3.194463,3.292683
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,1.0,-1.174613,1.275625,0.561951,0.972512,-1.0,1.341641,1.244049,1.411336,1.379715,...,1.0,-1.341641,-1.244049,-1.411336,1.8,1.669067,1.893506,1.547658,1.755772,1.991870
764,1.0,-1.363812,1.553943,1.123903,0.972512,-1.0,-1.341641,1.244049,1.411336,1.859984,...,1.0,1.341641,-1.244049,-1.411336,1.8,-1.669067,-1.893506,1.547658,1.755772,1.991870
765,1.0,-1.363812,1.553943,1.123903,0.972512,-1.0,-0.447214,1.244049,1.411336,1.859984,...,1.0,0.447214,-1.244049,-1.411336,0.2,-0.556356,-0.631169,1.547658,1.755772,1.991870
766,1.0,-1.363812,1.553943,1.123903,0.972512,-1.0,0.447214,1.244049,1.411336,1.859984,...,1.0,-0.447214,-1.244049,-1.411336,0.2,0.556356,0.631169,1.547658,1.755772,1.991870


In [27]:
x_transformed.columns

Index(['1', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X1^2', 'X1 X2',
       'X1 X3', 'X1 X4', 'X1 X5', 'X1 X6', 'X1 X7', 'X1 X8', 'X2^2', 'X2 X3',
       'X2 X4', 'X2 X5', 'X2 X6', 'X2 X7', 'X2 X8', 'X3^2', 'X3 X4', 'X3 X5',
       'X3 X6', 'X3 X7', 'X3 X8', 'X4^2', 'X4 X5', 'X4 X6', 'X4 X7', 'X4 X8',
       'X5^2', 'X5 X6', 'X5 X7', 'X5 X8', 'X6^2', 'X6 X7', 'X6 X8', 'X7^2',
       'X7 X8', 'X8^2'],
      dtype='object')

In [28]:
#Splitting the training and the test data
xtrain, xtest, ytrain, ytest= train_test_split(x_transformed, y, test_size=0.2)

In [29]:
xtrain.shape, ytrain.shape

((614, 45), (614,))

In [30]:
xtest.shape, ytest.shape

((154, 45), (154,))

In [31]:
#Training the model
linReg = LinearRegression()
linReg.fit(xtrain, ytrain)

In [32]:
# for testing the model
ypred = linReg.predict(xtest)

In [33]:
# to check accuracy of model
r2_score(y_pred=ypred,y_true=ytest)

0.9939982382171049

In [34]:
mean_squared_error(y_pred=ypred,y_true=ytest)

0.5949574610722527

The r2_score value increases and the mean_squared_error decreases significantly when we move away from Multilinear Regression to Polynomial Regression.
So, for our problem, the Polynomial Regression is definitely a better approach.