# Sklearn multiple linear regression Rand 1,2,3

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

In [8]:
data = pd.read_csv('Multiple Linear Regression.csv')
data.head() #display top 5 rows

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [9]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


    ran 1,2,3 is a variable that is randomly assignment to each sample.
    
     A sample is the machine learning term for obeservations. The same size is 84

## Create the multiple linear regression

In [10]:
x = data[['SAT', "Rand 1,2,3"]]
y = data['GPA']

In [11]:
reg = LinearRegression()
reg.fit(x,y)
# no need to reshape inputs because sklearn expects multiple

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [12]:
reg.coef_ #coefficants of SAT and Rand 1,2,3 (ordered in the way that was fed)

array([ 0.00165354, -0.00826982])

In [13]:
reg.intercept_

0.29603261264909353

### Calculating R-squared

**r-squared measures goodness of fit.** It is a universal measure to evaluate how well the linear regressions fares and compare

In [14]:
reg.score(x,y)

0.40668119528142815

     depending on the type of regression is that being performed, regression score will have a different meaning. This case it returns the r-sqaured of a linear reg. (both for simple and multiple linear regressions)

### Adjusted R-squared

**adjusted r-squared is better for multiple lin. reg.** It steps on the r-squared and adjusts for the number of variables included in the model. If we are using features with little/no explainatory power, adj. r-squared will penalize it (and r-squared will increase)
    
    however there is no method for this in sklearn

#### Formula for Adjusted R-Squared

$R^2_{adj.} = 1-(1-R^2)*\frac{n-1}{n-p-1}$

    n = sample size (84)
    p = # of predictors/features (2)
    as shown below:

In [15]:
x.shape

(84, 2)

In [16]:
r2 = reg.score(x,y)

n = x.shape[0]
p = x.shape[1]
adj_r2 = 1- (1- r2)* ((n-1)/(n-p-1))
adj_r2

0.39203134825134

**adj. r^2 is considerable lower than r^2 (.407 > 0.392) this implies  one or more predictors  have little or no explainatory power.** How to determine with feature is useless? see next section below

### Feature Selection

**Feature selection simplifies models. Improves speed and prevents a series of unwanted issues arising from having too many features.**

p-value > 0.05 we disregard the feature 

no method in sklearn to calculate p-value (since it is a ml package and not a stats). a similar concept is 

feature_selection.f_regression

f-regression creates simple linear regression of each feature and dependent variable

1. GPA <- SAT
2. GPA <- Rand 1,2,3

it will calculate the f-stat for each of these regressions and return the p-values

NOTE: for simple lin. reg the p-value of F-stat = the p-value of the only indepedent variable

In [17]:
from sklearn.feature_selection import f_regression

In [18]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

**First array is F-statistics, second array is p-values**

In [19]:
p_values = f_regression(x,y)[1]
p_values

array([7.19951844e-11, 6.76291372e-01])

In [20]:
p_values.round(3)

array([0.   , 0.676])

SAT p-value = 0.000, Rand 1,2,3 p-value = 0.676 >0.05 therefore Rand1,2,3 is useless

NOTE: these are the univariate p=values reaached from a simple linear models. they don't reflect the interconnection of the feature in our multiple linear regression

## Creating a Summary Table

In [21]:
reg_summary = pd.DataFrame(data=x.columns.values, columns=['Features']) #data=['SAT','Rand 1,2,3']
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [22]:
reg_summary['Coefficents'] = reg.coef_
reg_summary['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficents,p-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


**p-values is one of the best ways to determine if a var.** is redundant, but they don't provide info about how useles a var is
e.g. two vars with p-value 0.000 does not mean that the two are equally important