# Multiple Linear Regression - Standardization 

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression


### Load the Data

In [87]:
data = pd.read_csv('1.02.+Multiple+linear+regression.csv')
data.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [88]:
data.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


In [89]:
data['SAT'].min()

1634

## Creating the multiple regression

### Declare the dependent and independent variables

In [90]:
x = data[['SAT','Rand 1,2,3']]  # input or feature
y = data['GPA']  # output or target

### Standardization

The range of the variables are completely different. SAT range is 1634-2050 while Random 1,2,3 is 1-3. In order to corret evaluate the real impact of each feature we need to normalize, or standardize the data and then perform a new linear regression between them.


In [91]:
from sklearn.preprocessing import StandardScaler


In [92]:
scaler = StandardScaler()    #object created to standardize the data

In [93]:
scaler.fit(x)  # prepare data to be standardized

StandardScaler()

###  Important! This is how to standardize all features: 

In [94]:
x_scaled = scaler.transform(x) 

In [95]:
x_scaled

array([[-1.26338288, -1.24637147],
       [-1.74458431,  1.10632974],
       [-0.82067757,  1.10632974],
       [-1.54247971,  1.10632974],
       [-1.46548748, -0.07002087],
       [-1.68684014, -1.24637147],
       [-0.78218146, -0.07002087],
       [-0.78218146, -1.24637147],
       [-0.51270866, -0.07002087],
       [ 0.04548499,  1.10632974],
       [-1.06127829,  1.10632974],
       [-0.67631715, -0.07002087],
       [-1.06127829, -1.24637147],
       [-1.28263094,  1.10632974],
       [-0.6955652 , -0.07002087],
       [ 0.25721362, -0.07002087],
       [-0.86879772,  1.10632974],
       [-1.64834403, -0.07002087],
       [-0.03150724,  1.10632974],
       [-0.57045283,  1.10632974],
       [-0.81105355,  1.10632974],
       [-1.18639066,  1.10632974],
       [-1.75420834,  1.10632974],
       [-1.52323165, -1.24637147],
       [ 1.23886453, -1.24637147],
       [-0.18549169, -1.24637147],
       [-0.5608288 , -1.24637147],
       [-0.23361183,  1.10632974],
       [ 1.68156984,

# Regression with scaled features


In [96]:
reg = LinearRegression()
reg.fit(x_scaled, y)

LinearRegression()

In [97]:
reg.coef_

array([ 0.17181389, -0.00703007])

In [98]:
reg.intercept_

3.330238095238095

In [99]:
reg.summary = pd.DataFrame([['Bias'], ['SAT'], ['Rand 1,2,3']], columns=['Features'])    # in ML, the intercept is called Bias
reg.summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]

In [100]:
reg.summary

Unnamed: 0,Features,Weights
0,Bias,3.330238
1,SAT,0.171814
2,"Rand 1,2,3",-0.00703


### We can see thar the feature Rand 1,2,3 has pratically none impact in the GPA. 
In this case we will drop it from the model. This standardization technique is also useful to select imortant features. 

### Making predictions with the standardizes coefficients (weights)


In [101]:
# Predicting the score of 2 students (1700 and 1) , (1800 and 1) 
new_data = pd.DataFrame(data=[[1700,2],[1800, 1]], columns=['SAT', 'Rand 1,2,3']) #, columns=['SAT', 'Rand 1,2,3']


In [102]:
new_data

Unnamed: 0,SAT,"Rand 1,2,3"
0,1700,2
1,1800,1


In [103]:
reg.predict(new_data)    # strange results...We need to use scaled inputs!



array([295.39979563, 312.58821497])

###  Strange results...We need to use scaled inputs

In [104]:
new_data_scaled = scaler.transform(new_data)    #  Standardized results
new_data_scaled

array([[-1.39811928, -0.07002087],
       [-0.43571643, -1.24637147]])

In [105]:
reg.predict(new_data_scaled)

array([3.09051403, 3.26413803])

###  What if we remove the Random 1,2,3 variable?


In [107]:
reg_simple = LinearRegression()
x_simple_matrix = x_scaled[:, 0].reshape(-1,1)    #dropping the variable Random 1,2,3 and reshaping to transform it in a array
reg_simple.fit(x_simple_matrix, y)

LinearRegression()

In [113]:
reg_simple.predict(new_data_scaled[:, 0].reshape(-1,1))

array([3.08970998, 3.25527879])

### Conclusion:
We already know from previous analyze that the variable Random 1,2,3 is not significant. 

As we can see the results with or without the this feature are pratically the same. For choosing the best variables in a multiple linear regression, if we use feature scaling (normalization), p-value analysis is not needed, because with normalized data, does not matter if the variable is significant or not. If it isn't, it will not impact the results. The feature scaling method captures only the real efetcs in the dependent variable.
