# Multiple linear regression with sk learn
In this demonstration we are going to predict GPA of student with 'SAT score' and 'Rand 1,2,3' (noise) and check whether both features are useful for our model or not (feature selection).

**We are not going to split our dataset into train and test set, as our objective is to only demonstrate feature selection with `P-values`.**

In [1]:
# Import relevent libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import f_regression
from sklearn import set_config as setcon
setcon(print_changed_only=False)
sns.set()

## Load the data

In [2]:
data=pd.read_csv(r'datasets/2.Multiple_linear_regression.csv')
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [3]:
# Check summary of our dataset
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   SAT         84 non-null     int64  
 1   GPA         84 non-null     float64
 2   Rand 1,2,3  84 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 2.1 KB


In [5]:
# Check null values
data.isnull().sum()

SAT           0
GPA           0
Rand 1,2,3    0
dtype: int64

In [6]:
# Check shape of our data
data.shape

(84, 3)

## Create the multiple linear regression
### Declare the dependent and independent variables

In [7]:
# use both 'SAT' and 'Rand 1,2,3' for model training
x=data[['SAT','Rand 1,2,3']]
y=data['GPA']

## Regression itself

In [8]:
reg=LinearRegression()
reg.fit(x,y)

In [9]:
# Check coefficient for both features in best fitted line
reg.coef_

array([ 0.00165354, -0.00826982])

In [10]:
# Check intercept
reg.intercept_

0.29603261264909486

## Calculating the $R^{2}$

In [11]:
# Score our data with R_squared metrics
R_square=reg.score(x,y)
R_square

## Calculating Adjusted $R^{2}$
### $R_{adj}^{2}=1-(1-R^{2})*\frac{(n-1)}{(n-p-1)}$
### n=no of observations
### p=no of predictors

In [12]:
x.shape

(84, 2)

In [13]:
n=x.shape[0]
p=x.shape[1]
R_adj=1-(1-R_square)*((n-1)/(n-p-1))

In [14]:
R_adj

0.39203134825134023

## Feature Selection

In [15]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

In [16]:
p_values=f_regression(x,y)[1].round(3)
f_values=f_regression(x,y)[0].round(3)

In [17]:
p_values

array([0.   , 0.676])

## Create a summary table

In [18]:
reg_summary=pd.DataFrame(data=x.columns.values,columns=['Features'])
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


### P-Value: 
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.

The lower the p-value, the greater the statistical significance of the observed difference.

A p-value of 0.05 or lower is generally considered statistically significant.

In [19]:
reg_summary['Coefficient']=reg.coef_
reg_summary['P-values']=p_values
reg_summary

Unnamed: 0,Features,Coefficient,P-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


### Rand 1,2,3 is redundant due to higher P-values