<a href="https://colab.research.google.com/github/PaulinaJohn/My-first-Machine-Learning-Project/blob/main/Multiple_Linear_Regression_Model_practice_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
# import libraries needed

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [36]:
# let's meet the data. 
# Algorithm: Identify the format in which the dataset is provided, import the dataset using the appropriate import function 
# and catch a glimpse of the data by viewing the first few rows

# loading in the dataset
df = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx")

# Previewing the data; first 5 rows
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [37]:
# let's rename the columns. 
# Algorithm: first, I'll create a dictionary with keys as the old column names and values as the new column names
# then, I'll feed this dictionary to the `.rename()` method by assigning it as value to the `columns` arguement.

# creating the dictionary
column_names = {
    "X1": "Relative_Compactness", 
    "X2": "Surface_Area", 
    "X3": "Wall_Area", 
    "X4": "Roof_Area", 
    "X5": "Overall_Height", 
    "X6": "Orientation", 
    "X7": "Glazing_Area", 
    "X8": "Glazing_Area_Distribution", 
    "Y1": "Heating_Load", 
    "Y2": "Cooling_Load"
    }

# feeding the dictionary to the .rename() method
energy_data = df.rename(columns = column_names)

energy_data.head()


Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Orientation,Glazing_Area,Glazing_Area_Distribution,Heating_Load,Cooling_Load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [38]:
# Let's create a multple linear regression model. Unlike the simple linear regression model,
# this implies that more than one independent variable is used to predict the dependent variable. 
# First, we begin by normalizing our dataset- this will make the range of values for all the features(already numeric values) the same. 
# The importance of this is that the model is able to learn more quickly and this improves its validation accuracy.

# normalizing the dataset to a common scale using the MinMaxScaler

normalized_energy_data = pd.DataFrame(MinMaxScaler().fit_transform(energy_data), columns = energy_data.columns)

# let's drop values that can the used as target variables from the normalized data to get a variable of our predictors
predictor_features = normalized_energy_data.drop(columns= ["Heating_Load", "Cooling_Load"])

# We want to focus on predicting the Heating_Load for now so, let's isolate `Heating_Load` as our target/response variable

heating_target = normalized_energy_data["Heating_Load"]

In [39]:
# let's now split our dataset into training and testing subsets.

x_train, x_test, y_train, y_test = train_test_split(predictor_features, heating_target, test_size = 0.3, random_state = 1)
#I am using 70% of the dataset to train the model and 30% to test.

In [40]:
# Now, let's build the model using the linear regression algorithm
model = LinearRegression()

# fitting the model to the training dataset

model.fit(x_train, y_train)

# training completed. Now let's throw in our test data to test our model and otain predicted values

predicted_values = model.predict(x_test)

In [41]:
# Now, let's test the performance of our multiple linear regession model. I will try a few evaluation metrics here.

#Let's begin with the MAE- Mean absolute error

from sklearn.metrics import mean_absolute_error

mae = round((mean_absolute_error(y_test, predicted_values)), 3)

mae

0.063

The lower the MAE, the more accurate the model 

In [42]:
# let's try the RMSE- Root Mean Squared Error

from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, predicted_values))

rmse.round(3) # or round(rmse, 3)

0.089

Like the MAE, the lower the RMSE, the better the model.

In [43]:
# let's try the R_squared

from sklearn.metrics import r2_score

r2_score = round((r2_score(y_test, predicted_values) * 100), 3)

r2_score

89.348

For the R2 score, the higher, the better

Note: While I have a good foundational understanding of the Machine learning (ML) domain, the types of ML, and when to use them, how businesses can use ML to their advantage, and considerations in choosing a ML algorithm, this is my first time getting my hands dirty with ML so, pardon me If I have not factored in concepts like overfitting and underfitting, assumptions of the linear regression model, such as multicollinearity, Homoscedasticity, Independence of the variables, etc, in this work, as I am still trying to understand how to look out for them. I am hoping to be able to examine them in subsquent ML projects.

