<a href="https://colab.research.google.com/github/ShivSubedi/TraditionalML_PredictiveAnalytics_basics/blob/main/Linear_Regression/Multiple_Linear_Regression/Check_RandomVariableEffect/Multiple_linear_regression_incRandomVariable.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bulding a Multiple Regression Model (Population Model)
We build a multiple regression model upon the 'simple regression model'.
The data contains 'SAT' and 'GPA' parameters (like in the simple linear regression model), and now to explore the multiple regression model case, a new parameter (independent variable) 'Rand 1,2,3' is also included in the data. This new variable assigns a random integer value from 1-3 to each student.

**This is a test to see if adding a new random variable such as 'Rand 1,2,3' helps improve the prediction of GPA or makes it worst.**

At the end, we will observe and compare the model summary and see how the model rejects the inclusion of a random variable like 'Rand 1,2,3' since it has no relevance in predicting the 'GPA' score.

## Step 1: Import relevant libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()

## Step 2: Load the Data
Data is saved in the github and will be imported from the github url.


In [38]:
# Download the file from GitHub
github_Rawpath_to_multipleLinearReg_data = "https://raw.githubusercontent.com/ShivSubedi/TraditionalML_PredictiveAnalytics_basics/refs/heads/main/Linear_Regression/Multiple_Linear_Regression/Check_RandomVariableEffect/Multiple_linear_regression_incRandVar.csv"
!wget -O Multiple_linear_regression.csv {github_Rawpath_to_multipleLinearReg_data} #download the csv file

--2025-04-04 19:17:53--  https://raw.githubusercontent.com/ShivSubedi/TraditionalML_PredictiveAnalytics_basics/refs/heads/main/Linear_Regression/Multiple_Linear_Regression/Check_RandomVariableEffect/Multiple_linear_regression_incRandVar.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1114 (1.1K) [text/plain]
Saving to: ‘Multiple_linear_regression.csv’


2025-04-04 19:17:53 (43.8 MB/s) - ‘Multiple_linear_regression.csv’ saved [1114/1114]



In [7]:
#load the data. Data is saved in the csv format
data_csv_mlr= pd.read_csv('Multiple_linear_regression.csv')

In [36]:
#explore the data type and make sure that data exists
data_csv_mlr # gives overview of head and tail section of data
# data_csv_mlr.head(10) #only returns head, can optionally specify #entries for head
# data_csv_mlr.tail()
# data_csv_mlr.dtypes #returns data types of each column
# len(data_csv_mlr) #returns the total number of entries (rows) in the table
# data_csv_mlr['GPA'] #returns a panda series, i.e. 1d array
# data_csv_mlr[['GPA', 'SAT']] #new Pandas DataFrame containing only the specified columns.


Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.40,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2
...,...,...,...
79,1936,3.71,3
80,1810,3.71,1
81,1987,3.73,3
82,1962,3.76,1


## Step 3: Descriptive Statistics

In [19]:
data_csv_mlr.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


## Step 4: Build the Multiple Regression model

### 4(a) Define the dependent and independent variables

In [27]:
x = data_csv_mlr[['SAT', 'Rand 1,2,3']] #defining the independent variables
y = data_csv_mlr['GPA'] #defining the dependent variable

In [28]:
x1=sm.add_constant(x) #add a column of constant '1', which would be the x value for the constant term, i.e. x^0=1

### 4(b) Perform Regression Analysis
Here, we perform following steps:
1. Prepare the independent variable data (X) for linear regression by adding a constant term.

For a multiple regression model is represented in a polynomial (multiple regression equation) form as: y = *b<sub>0</sub>x<sup>0</sup> + b<sub>1</sub>*x<sub>1</sub>+ b<sub>2</sub>*x<sub>1</sub>, where x<sup>0</sup> =1.

When performing linear regression in Python using libraries like statsmodels or scikit-learn, calculations are often done using matrix operations. To accommodate the constant (intercept) term b<sub>0</sub> in matrix form, we need to represent it as a separate column of 1s. This column of 1s effectively multiplies the intercept b<sub>0</sub> in the matrix calculation, ensuring that b<sub>0</sub> is included in the model.

2. Fit a linear regression model using the **OLS (Ordinary Least Square)**method, with dependent variable 'y' and new independent variable 'x1'

OLS aims to minimize the sum of the squared differences between the observed values and the values predicted by the model. In simpler terms, OLS finds the line that minimizes the total error of the predictions (i.e. least SSE).

In [35]:
result_OLS=sm.OLS(y,x1).fit() #perform the Orignary least square fit.
coeff=result_OLS.params
print('Coeff of const term (bo)', coeff.iloc[0])
print('Coeff of SAT score (b1)', coeff.iloc[1])
print('Coeff of Rand 1,2,3 (b2)', coeff.iloc[2])

Coeff of const term (bo) 0.29603261264909797
Coeff of SAT score (b1) 0.0016535418013456677
Coeff of Rand 1,2,3 (b2) -0.008269822523160442


In [30]:
result_OLS.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Fri, 04 Apr 2025",Prob (F-statistic):,6.58e-10
Time:,16:35:48,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


## Step 5: Comparison with the 'Simple Regression Model'
We conclude below that adding the random variable 'Rand 1,2,3':
- lower's the explanatory power of the model, as described in 5(a)
- is insignificant too

Hence, this variable should be dropped from the model, because the bias of this variable most likely affects the coefficients of other variables (*b<sub>1</sub> and b<sub>2</sub>*) as well.

After dropping the variale from the model, we should run a new model without including that variable to avoid above bias issue.

### 5(a) Comparing R-squared and Adj. R-squared
From the 'Simple Regression Model (LRM)' we observed the following:
- R-squared= 0.406
- Adj. R-squared = 0.399

From the 'Multiple Regression Model (MRM)' above we observed the following:
- R-squared= 0.407
- Adj. R-squared = 0.392

Comparison:
- R-squared: comparison shows that we might have increased  the explanatory power of the model in the MRM compared to LRM
- Adj. R-squared: value in MRM< LRM, which suggest that out model is penalized for including a variable such as 'Random 1,2,3' which doesn't have any relevance and explanatory power to predict the GPA. We added more information in the model, but have lost the value.

 ### 5(b) Can adding impractical variable be pointed by the model itself?
Yes, when looking into the coeffients table, we notice that the p-value (P>|t|) = 0.762>>0.05. We can't reject the null-hypothesis (H<sub>0</sub>:*b<sub>0</sub>* = 0) at 76.2% significance level.

### 5(c) Comparing the F-statistic
From the 'Simple Regression Model (LRM)' we observed the following:
- F-statistic= 56.05
- prob(F-statistic) = 7.20e-11

From the 'Multiple Regression Model (MRM)' above we observed the following:
- F-statistic= 27.76
- prob(F-statistic) = 6.58e-10


Comparison:
**Note: The lower the F-statistic, closer it is to a non-significant model**

In comparison, in MRM the F-statistic is lower compared to the LRM. So the model has lower significance.