# Adjusted R-squared - Exercise

Using the code from the lecture, create a function which will calculate the adjusted R-squared for you, given the independent variable(s) (x) and the dependent variable (y).

Check if you function is working properly.

Please solve the exercise at the bottom of the notebook (in order to check if it is working you must run all previous cells).

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [3]:
data.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


## Create the multiple linear regression

### Declare the dependent and independent variables

In [4]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']

### Regression itself

In [5]:
reg = LinearRegression()
reg.fit(x,y)

LinearRegression()

In [6]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [7]:
reg.intercept_

0.29603261264909486

### Calculating the R-squared

In [8]:
reg.score(x,y)

0.40668119528142843

### Formula for Adjusted R^2: Goodness of fits. It is appropriate for multiple regression analysis than R^2

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [9]:
x.shape

(84, 2)

In [10]:
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.39203134825134023

#My conclussion is that the Adjusted R^2 > R^2. There 4 one or more of the predictors have litle or no explanatory power

In [11]:
#We can check to confirm correctness by using StatsModels summary for teh same regression

### FEATURE SELECTION MODULE FROM SKLEARN: AKA F_Regression, it creates simple linear regressions of each feature and the dependent variable. it help to gain improve speed and also help in removing teh variable that is not relevant.Recall if a variabe  has a value above 0.05% we can discard it. it calculate for each of the variables ...

In [12]:
from sklearn.feature_selection import f_regression

In [13]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

#First two contains the FStatisctics while the remaining two contains the P_Values'

In [14]:
#We are only interested in p_values:
p_values = f_regression(x,y)[1]

In [15]:
p_values

array([7.19951844e-11, 6.76291372e-01])

In [16]:
p_values.round(3)

array([0.   , 0.676])

In [17]:
#Pvalues of SAT is 0.000 while Rand 1,2,3  is 0.676 hence discard the Rand statistics.

In [18]:
#CREATING A SUMMARY TABLE

In [19]:
reg_summary = pd.DataFrame(data = ['SAT', 'Rand 1,2,3'],columns= ['Features'])

In [20]:
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [21]:
reg_summary = pd.DataFrame(data = x.columns.values, columns =['Features'])

In [22]:
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [23]:
#The second is much preferrable for models with alot of features

In [24]:
reg_summary['Coefficients'] = reg.coef_
reg_summary['p_values'] = p_values.round(3)

In [25]:
reg_summary

Unnamed: 0,Features,Coefficients,p_values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


From the above Rand 1,2,3 is not contributing to our model hence we remove it.

Final Remark: P_Values is one of the best way to determine if a variable is useful/redundant, 
But they provide no informtion about how useful the variables are.


STANDADIZATION OR FEATURE SCALING: it is the process of transformimg the data into a standard scale. X-mean/SD. It make it easier to relate with the data. It allows us to force the data to appear the same

In [26]:
from sklearn.preprocessing import StandardScaler

In [27]:
scaler = StandardScaler()

In [28]:
scaler.fit(x)

StandardScaler()

In [29]:
#We have to Transform the data

In [30]:
x_scaled = scaler.transform(x)

In [31]:
x_scaled

array([[-1.26338288, -1.24637147],
       [-1.74458431,  1.10632974],
       [-0.82067757,  1.10632974],
       [-1.54247971,  1.10632974],
       [-1.46548748, -0.07002087],
       [-1.68684014, -1.24637147],
       [-0.78218146, -0.07002087],
       [-0.78218146, -1.24637147],
       [-0.51270866, -0.07002087],
       [ 0.04548499,  1.10632974],
       [-1.06127829,  1.10632974],
       [-0.67631715, -0.07002087],
       [-1.06127829, -1.24637147],
       [-1.28263094,  1.10632974],
       [-0.6955652 , -0.07002087],
       [ 0.25721362, -0.07002087],
       [-0.86879772,  1.10632974],
       [-1.64834403, -0.07002087],
       [-0.03150724,  1.10632974],
       [-0.57045283,  1.10632974],
       [-0.81105355,  1.10632974],
       [-1.18639066,  1.10632974],
       [-1.75420834,  1.10632974],
       [-1.52323165, -1.24637147],
       [ 1.23886453, -1.24637147],
       [-0.18549169, -1.24637147],
       [-0.5608288 , -1.24637147],
       [-0.23361183,  1.10632974],
       [ 1.68156984,

#REGRESSION WITH SCALED FEATURES

In [35]:
reg = LinearRegression()

In [36]:
reg.fit(x_scaled,y)

LinearRegression()

In [37]:
reg.coef_

array([ 0.17181389, -0.00703007])

In [38]:
reg.intercept_

3.330238095238095

CREATING A SUMMARY TABLE

In [44]:
reg_summary = pd.DataFrame([['Bias'],['SAT'],['Rand 1,2,3']], columns=['Features'])
reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]

In [45]:
reg_summary

Unnamed: 0,Features,Weights
0,Bias,3.330238
1,SAT,0.171814
2,"Rand 1,2,3",-0.00703


#From the above we can clearly see that Rand 1,2,3 barely contributed to the output 

Above is the reason why SK  learn seldomy use p_value to detaermine the useleless or worst performing features because  they usaullay shows with litle o no weight on the summary table.

MAKING PREDICTIONS WITH STANDARDIZED COEFFICIENTS(WEIGHTS)

In [48]:
new_data = pd.DataFrame(data=[[1700,2],[1800,1]],columns=['SAT','Rand 1,2,3'])
new_data

Unnamed: 0,SAT,"Rand 1,2,3"
0,1700,2
1,1800,1


In [49]:
reg.predict(new_data)



array([295.39979563, 312.58821497])

In [51]:
new_data_scaled = scaler.transform(new_data)

In [52]:
new_data_scaled

array([[-1.39811928, -0.07002087],
       [-0.43571643, -1.24637147]])

In [53]:
reg.predict(new_data_scaled)

array([3.09051403, 3.26413803])

What if we removed Random 1,2,3 variable?

In [54]:
#We must create a new regression
#Declare the input
reg_simple = LinearRegression()
x_simple_matrix = x_scaled[:,0].reshape(-1,1)
reg_simple.fit(x_simple_matrix,y)

LinearRegression()

In [56]:
reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))

array([3.08970998, 3.25527879])

#From the above it has been established that without removing the p_value (Random1,2,3) the weight is very close to zero as can be seen that the outputs are approximately same.


#####LIKELY  RECRUITMENT QUESTION!
UNDERFITTING AND OVERFITTING: Broadly speaking, Overfitting means our Regression has focused so much on a particualr dataset to the extent that it has missed the  point. Focused too much on a particular trainng and missed the main purpose captured or focused on the main data  hence it has missed the point. Random noise is captured while 

UNDEFITTING The model has not taken cognisant of the main dataset instaed on irrelivancies.It doeant capture any logic

In [59]:
#TO RESOLVE WE SPLIT THE TWO DATA INTO TWO PARTS (Training and Testing)

Train Test Split

Import relevant Libraries

In [60]:
import numpy as np
from sklearn.model_selection import train_test_split

Generate Some DATA we are going to Split

In [63]:
a = np.arange(1,101)

In [64]:
a

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [102]:
b = np.arange(501, 601)

In [103]:
b

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

In [104]:
#Split the Data

In [105]:
train_test_split(a)

[array([ 95,  37,  10,  27,  87,  99,  15,  19,  75,  79,  32,  81,  80,
         25,   8,  17,   1,  67,  11,  92,  68,  20,  23,  57,  39,  48,
        100,   9,  66,  69,  71,   5,   3,  26,  60,  86,  28,  45,  77,
         73,  44,  14,  74,  24,  41,  29,  38,  72,  93,   4,  36,  35,
         42,  97,  82,  83,  96,  63,  61,  70,  18,  58,  31,  51,  55,
         85,  21,  33,  64,   2,  90,  40,   6,  30,  56]),
 array([34, 54, 53, 76, 62, 98, 43, 78, 50, 52, 13, 12, 22, 16,  7, 84, 89,
        94, 59, 91, 49, 47, 88, 46, 65])]

In [106]:
#Let store the results in dedicated variable
a_train, a_test = train_test_split(a)

In [107]:
#Explore the results and check the shape of the splitted arrays, to see default split
a_train.shape, a_test.shape

((75,), (25,))

In [108]:
a_train

array([ 20,   6,  25,  48,  94,  71,  68,  15,  22,  67,  35,   8,  62,
        99,  23,   3,  84,  90,  56,  76,   1,  80,  52,   7,  65, 100,
        39,  59,  44,  17,  64,   4,  30,  54,  33,  60,  24,  42,  26,
        21,  97,  85,  53,  98,  66,  46,  50,  78,  11,  88,  91,  43,
        63,  87,  28,  18,  19,  37,  34,  82,  83,  70,  89,  10,  41,
        31,  72,  49,  77,  36,  38,  16,  93,  86,  40])

In [109]:
a_test

array([14, 92, 69, 13, 73, 29, 47, 79, 75, 45,  5,  2, 55, 96, 51, 61, 58,
        9, 27, 81, 74, 57, 12, 32, 95])

In [110]:
#From the above tthe size of the test is too big hence the need to reduce it.
a_train, a_test = train_test_split(a, test_size= 0.2)

In [111]:
a_train

array([ 46,  51,  63,  69,  78,  24,  98,  61,  53,  49,  79,   2,  82,
        85,  58,  80,  64,  95,  87,   9,  27,  36,  66,  44,  86,  15,
         5,  22,  11,  90,  42,  72,  48,  33,  37, 100,  52,  54,  75,
        93,   1,  99,  62,  10,  84,  76,  70,  45,   6,  57,  17,  83,
         3,  13,  40,  67,  73,   4,  23,  68,  39,  41,   7,  71,  30,
        28,  56,  31,  91,  77,  89,  25,  19,  92,  97,  16,  18,  65,
        74,  43])

In [112]:
a_test

array([14, 29, 96, 47, 20, 81, 32, 50,  8, 21, 35, 34, 94, 12, 60, 26, 38,
       59, 55, 88])

In [113]:
#Now we have 80%- 20%

In [114]:
#Change shuffle to false to help prevent improper arrangement of the data . i.e 1-80, 81-100
a_train, a_test = train_test_split(a, test_size= 0.2, shuffle=False)

In [115]:
a_train

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])

In [116]:
a_test

array([ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99, 100])

In [117]:
#SK Learn has an inbuilt random state argument which provide constant arrays even when reshuffled at different times
#We can also split more tahn 1 array at time
a_train,a_test,b_train,b_test = train_test_split(a,b, test_size= 0.2,random_state=365)

In [118]:
a_train

array([ 25,  32,  99,  73,  91,  66,   3,  59,  94,   1,   8,  15,  90,
        54,  31,  20,  77,  82,  30,  35,  95,  42,  38,   7,  11,  50,
        21,  48,   2,  17,  10,  58,  68,  43,  41,  16,  88,  72,  79,
       100,  80,  39,  24,  86,  22,  23,  62,  76,  18,  47,  55,  26,
        60,  19,  71,  64,  51,  63,  65,  28,  12,  78,  13,  44,  75,
        87,  40,   4,  29,  49,  37,  57,  27,  74,   6,  45,  92,  34,
        53,  83])

In [119]:
a_test

array([ 9, 69, 81, 56, 33, 93, 84, 61, 46, 89, 85, 67, 97,  5, 70, 36, 98,
       96, 14, 52])

In [120]:
#Above can only change if we change the number 42 to another rans=dom number
#SK Learn has an inbuilt random state argument which provide constant arrays even when reshuffled at different times

a_train, a_test = train_test_split(a, test_size= 0.2,random_state=30 )

In [121]:
a_test

array([21, 92, 35, 53,  9, 75, 22, 89, 81, 90, 83, 39,  1, 78, 43, 68, 69,
       93, 49, 11])

In [122]:
a_train

array([ 94,   5,  52,  73,   2,  88,  79,  25,  72,  55,  96,  33,  64,
        70,  30,  31,  20,  61,  60,  44,  27,   6,  26,  84,  23,  37,
        67,  48,  87,  10,  40,  57,  99,  45,  74,  32,  71,  41,  58,
        15,  76,  59,  86,  29,  56,  34,  51,  14,  95,  16,  12,  80,
        65,  85,  91,  28,  63,   7,  17,  77,  82,  19,  36,  62,  97,
        50,  66,   8,  42,   4,  47,  18,  54,   3,  24,  13,  98,  46,
       100,  38])

In [123]:
b_train

array([525, 532, 599, 573, 591, 566, 503, 559, 594, 501, 508, 515, 590,
       554, 531, 520, 577, 582, 530, 535, 595, 542, 538, 507, 511, 550,
       521, 548, 502, 517, 510, 558, 568, 543, 541, 516, 588, 572, 579,
       600, 580, 539, 524, 586, 522, 523, 562, 576, 518, 547, 555, 526,
       560, 519, 571, 564, 551, 563, 565, 528, 512, 578, 513, 544, 575,
       587, 540, 504, 529, 549, 537, 557, 527, 574, 506, 545, 592, 534,
       553, 583])

In [124]:
b_test

array([509, 569, 581, 556, 533, 593, 584, 561, 546, 589, 585, 567, 597,
       505, 570, 536, 598, 596, 514, 552])