### Hypothesis Test for Linear Regression 

<b>Objective:</b><br>
A test to determine the significance of the regression  slope<br>
Here we will use a two-tail test to determine if the slope is significant, however the same rules would apply for a upper or lower one tail test. The population least squares regression line would be $y = \beta_0 + \beta_1 + \epsilon$<br>Where:<br><br>

* $\beta_0$ (pronounced beta naught) is the population y-intercept
* $\beta_1$ (pronounced beta one) is the population slope 
* $\epsilon$ (error term) 
<Br><br>

#### The Data and Problem Statement
We will only be using the two-tailed test for a population slope.<br><br>
The Hypothesis is centered around the relationship between x and y. <br><br>

* $H_0: \beta_1 = 0$
* $H_0: \beta_1 \ne 0$
<br>

If the regression equation has a slope of zero, then every x value will give the same y value and the regression equation would be useless for prediction. We should perform a t-test to see if the slope is significantly different from zero before using the regression equation for prediction. The numeric value of t will be the same as the t-test for a correlation. The two test statistic formulas are algebraically equal; however, the formulas are different and we use a different parameter in the hypotheses.<br><br>

The data we will use depicts the relationship of hours studied as the independent variable and exam scores as the dependent variable. <br><Br>
If the slope was horizontal $(\beta_1 = 0)$ the regression line would give the same y-value for every input of $x$ and would be of no use. 
<br>

In [14]:

import numpy as np
import pandas as pd 
import sys
sys.path.insert(0, '../..')
from resources import datum, glyph
from IPython.display import display, Math, Markdown

hrs_studied = [20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14]
exams = [89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76]

data = datum.Data()
plot = glyph.Glyph(
    title = 'Hours Studied/Exam Score Data\nbeta_1 = 0'
    , x_axis_label='Hrs Studed'
    , y_axis_label='Exam Scores'
    , y_range = (30, 100)
    )

# sanity check with my code
beta1 = data.get_slope(x = hrs_studied, y = exams)
beta0 = data.get_y_intercept(x = hrs_studied, y = exams, m = beta1)

plot.make_points(shape = 'circle', x = hrs_studied, y = exams, size = 8, line_color='crimson', fill_color='crimson')

plot.add_regression_line(slope = 0, y_intercept=beta0,line_color = 'gainsboro')

msg = '\\displaystyle \\color{gainsboro} y = \\beta_0 + \\beta_1 + \\epsilon\\\\~\\\\'
msg = msg + 'y = %s + %s + \epsilon'

plot.show()

display(Math(msg%(
    f'{beta0: .4f}', 0
)))

<IPython.core.display.Math object>

 If there is a statistically significant linear relationship then the slope needs to be different from zero. 

In [17]:

import numpy as np
import pandas as pd 
import sys
sys.path.insert(0, '../..')
from resources import datum, glyph, dsutils 
from IPython.display import display, Math, Markdown


hrs_studied = [20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14]
exams = [89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76]

ds = dsutils.Utils(x = hrs_studied, y = exams)
ds.chart_regression(chart=True)

data = datum.Data()

tail = 'two'
CL = .95
alpha = 1 - CL 
n = len(hrs_studied)
p = 1 # the number of predictors, (independent variables), for now set to 1
df = n - p - 1
xy_sum, mu_x_sqr_sum, mu_y_sqr_sum, cov, mu_cov  = data.chart_regression(x = hrs_studied, y = exams)

SSR = cov**2/mu_x_sqr_sum
SST = mu_y_sqr_sum
SSE = SST - SSR
MSR = SSR/p 
MSE = SSE/df


data = datum.Data()
plot = glyph.Glyph(
    title = 'Hours Studied/Exam Score Data\n'
    , x_axis_label='Hrs Studied'
    , y_axis_label='Exam Scores'
    )

# calculate critical value 
critical_value = data.get_t_critical_value(tail = tail, q = alpha, df = df)

# calculate the p-value
p_value = data.get_pvalue(test_statistic=critical_value, tail = 'upper')

# sanity check with my code
beta1 = data.get_slope(x = hrs_studied, y = exams)
beta0 = data.get_y_intercept(x = hrs_studied, y = exams, m = beta1)

test_stat = beta1/(np.sqrt(MSE/mu_x_sqr_sum))

plot.make_points(shape = 'circle', x = hrs_studied, y = exams, size = 8, line_color='crimson', fill_color='crimson')

plot.add_regression_line(slope = beta1, y_intercept=beta0,line_color = 'gainsboro')

msg = '\\displaystyle \\color{gainsboro} \\star~H_0:\\quad \\beta_1 = 0\\\\~\\\\'
msg = msg + '\\star~H_a: \\quad \\beta_1 \\ne 0\\\\~\\\\'
msg = msg + 'y = \\beta_0 + \\beta_1 + \\epsilon\\\\~\\\\'
msg = msg + 'y = %s + %s + \epsilon\\\\~\\\\\\star~n: %s \\\\~\\\\'
msg = msg + '\\star~p: %s \\text{ - the number of predictors, (independent variables)}\\\\~\\\\'
msg = msg + '\\star~df = n - p - 1 = %s\\\\~\\\\'
msg = msg + '\\star~\\text{critical value: } %s\\\\~\\\\'
msg = msg + '\\star~\\small SS_{xx}\\quad(\\mu_x^2) = %s\\\\~\\\\\\star~\\small SS_{yy}\\quad(\\mu_y^2) = %s \\normalsize \\\\~\\\\'
msg = msg + '\\star~\\small SS_{xy}\\quad(\\mu_x \\cdot \\mu_y) = %s\\\\~\\\\'
msg = msg + '\\star~SSR = \\dfrac{(\\mu_x \\cdot \\mu_y)}{\\mu_x^2} = \\dfrac{%s}{%s} = %s\\\\~\\\\'
msg = msg + '\\star~SST = \\mu_x^2 = %s\\\\~\\\\'
msg = msg + '\\star~SSE = \\text{ SST - SSR } = %s - %s = %s\\\\~\\\\'
msg = msg + '\\star~MSR = \\dfrac{SSR}{p} = \\dfrac{%s}{%s} = %s\\\\~\\\\'
msg = msg + '\\star~MSE = \\dfrac{SSE}{DF_e} = \\dfrac{%s}{%s} = %s\\\\~\\\\'
msg = msg + '\\star \\normalsize \\text{ test statistic } = \\small \\dfrac{\\beta_1}{\\sqrt{\\dfrac{MSE}{\\mu_x^2}}} = ' 
msg = msg + '\\dfrac{%s}{\\sqrt{\\dfrac{%s}{%s}}} = %s\\\\~\\\\'
msg = msg + '\\text{The test statistic 5.271 is greater than the critical value of 2.160 and in the rejection region.}\\\\~\\\\'
msg = msg + '\\text{The decision is to reject }H_0: \\quad \\beta_1 = 0\\\\~\\\\'
msg = msg + '\\text{The slope is significant different from zero therefore the regression equation is suitable to use for prediction}\\\\~\\\\'


plot.show()

display(Math(msg%(
    f'{beta0: .4f}', f'{beta1: .4f}' 
    , n, p, df
    , f'{critical_value: .4f}'
    , mu_x_sqr_sum, f'{mu_y_sqr_sum: .4f}', f'{cov: .4f}'
    , f'{(cov)**2: .4f}',  mu_x_sqr_sum, f'{SSR: .4f}'
    , f'{mu_y_sqr_sum: .4f}'
    , f'{SST: .4f}', f'{SSR: .4f}', f'{SSE: .4f}'
    , f'{SSR: .4f}', p, f'{MSR: .4f}'
    , f'{SSE: .4f}', df, f'{MSE: .4f}'
    , f'{beta1: .4f}', f'{MSE: .4f}', mu_x_sqr_sum, f'{test_stat: .4f}'
)))

mu = 0
s = 1
norm_x = list(np.linspace(-6, 6, 100))
norm_y = data.get_normal_dist(x_arr=norm_x, mu = mu, sigma = s)

norm_dist = glyph.Glyph(
    title = ''
    , x_axis_label='x'
    , y_axis_label='y'
    ,legend_location='top_left'
    )

norm_dist.make_line(norm_x, norm_y, color = 'gainsboro', alpha = .5)

# get arrays for lower and upper regions of rejection
critical_value_auc = data.get_t_auc(critical_values = [critical_value], df = df, mu = mu)

lower_rej_reg_x, upper_rej_reg_x = data.get_rejection_regions(x = norm_x, tail = tail, alpha = critical_value_auc[0], observed_value = mu, sigma = s )

# plot lower regions of rejections 
lower_rej_reg_y = data.get_normal_dist(x_arr = lower_rej_reg_x, mu = mu, sigma = s)
lower_rej_reg_floor = np.zeros(len(lower_rej_reg_y))
norm_dist.make_varea(x=lower_rej_reg_x
                     , y=lower_rej_reg_y
                     , floor = lower_rej_reg_floor
                     , fill_color = 'grey'
                     , fill_alpha=0.80
                     , legend_label=f"critical values: [- {critical_value: .4f}, {critical_value: .4f}]"
                    )

# plot upper regions of rejections 
upper_rej_reg_y = data.get_normal_dist(x_arr = upper_rej_reg_x, mu = mu, sigma = s)
upper_rej_reg_floor = np.zeros(len(upper_rej_reg_y))
norm_dist.make_varea(x=upper_rej_reg_x
                     , y=upper_rej_reg_y
                     , floor = upper_rej_reg_floor
                     , fill_color = 'grey'
                     , fill_alpha=0.80)

# plot test statistic 
norm_dist.make_points(
    shape = 'star'
    , x = [test_stat]
    , y = data.get_normal_dist([test_stat], mu = mu, sigma = s)
    , fill_color='crimson'
    , line_color='crimson'
    , size = 10
    , label =  f'test statistic: {test_stat: .4f}'
    )


norm_dist.show()

<br><b>Regresion Chart</b><br><br>|  point #  |  $x$  |   $y$   |  $xy$   |  $\mu_x^2$  |  $\mu_y^2$  |  $\mu_x \cdot \mu_y$  |
|:---------:|:-----:|:-------:|:-------:|:-----------:|:-----------:|:---------------------:|
|     1     |  20   |   89    |  1780   |    11.56    |   78.6178   |        30.1467        |
|     2     |  16   |   72    |  1152   |    0.36     |   66.1511   |         4.88          |
|     3     |  20   |   93    |  1860   |    11.56    |   165.551   |        43.7467        |
|     4     |  18   |   84    |  1512   |    1.96     |   14.9511   |        5.41333        |
|     5     |  17   |   81    |  1377   |    0.16     |  0.751111   |       0.346667        |
|     6     |  16   |   75    |  1200   |    0.36     |   26.3511   |         3.08          |
|     7     |  15   |   70    |  1050   |    2.56     |   102.684   |        16.2133        |
|     8     |  17   |   82    |  1394   |    0.16     |   3.48444   |       0.746667        |
|     9     |  15   |   69    |  1035   |    2.56     |   123.951   |        17.8133        |
|    10     |  16   |   83    |  1328   |    0.36     |   8.21778   |         -1.72         |
|    11     |  15   |   80    |  1200   |    2.56     |  0.0177778  |       0.213333        |
|    12     |  17   |   83    |  1411   |    0.16     |   8.21778   |        1.14667        |
|    13     |  16   |   81    |  1296   |    0.36     |  0.751111   |         -0.52         |
|    14     |  17   |   84    |  1428   |    0.16     |   14.9511   |        1.54667        |
|    15     |  14   |   76    |  1064   |    6.76     |   17.0844   |        10.7467        |
| $\Sigma$  |  249  |  1202   |  20087  |    41.6     |   631.733   |         133.8         |
|   $\mu$   | 16.6  | 80.1333 | 1339.13 |   2.77333   |   42.1156   |         8.92          |

<IPython.core.display.Math object>

### ANOVA Table 

* $SS_{xx} = \mu_x^2$

* $SS_{yy} = \mu_y^2$

* $SS_{xy} = \mu_x \cdot \mu_y$

* $\displaystyle \small SSR = \dfrac{(SS_{yy})^2}{SS_{xx}}$

* $\displaystyle \small SST - \text{The sum of squares for treatments } = SS{yy}$
  
* $\displaystyle \small SSE - \text{The sum of squares for errors } = SST - SSR$

* $\displaystyle \small DFT - \text{The degrees of freedom for treatments}$
  * $\displaystyle \small DFT = k - 1$

* $\displaystyle \small DFE - \text{The degrees of freedom for error}$
  * $\displaystyle \small DFE = N - k$

* $\displaystyle \small MST - \text{The mean of squares for errors}$

  * MST = $\displaystyle \small \dfrac{SST}{DFT}$
  
* $\displaystyle \small MSE - \text{The mean of squares for treatments}$
  
  *$\displaystyle \small  MSE = \dfrac{SSE}{DFE}$
  

  

  


