## Simple Linear Regression ##

### Introduction ###
- Suppose we have two measurements for each individual, and we have $n$ individual.
- One of the measurement is a response variable $y_i,\ i=1,2,\ldots, n$.
- We have $1$ explanatory variables $x_{i}, i =1,2,\ldots, n$
- We assume a linear relation between them
$$y_i =  \beta_0 + \beta_1 x_i + \epsilon_i$$
    

- $\epsilon_i$ are Normal disturbance terms, e.g., due to measurement error
- $\epsilon_i$ is the only source of the randomness that we care about. Since we are interested in $p(Y|X)$, we can assume $X$ has an arbitrary distribution or non-random. 
- $\beta_0$ and $\beta_1$ are key parameters to be estimated. 
- each disturbance $\epsilon_i$ has mean 0, and the same variance $\sigma^2$
 
 <img src="https://fmai-teaching.s3.amazonaws.com/bia652/regression_assumption.jpg" width="500px"></img>

$\epsilon_i$ are independent from each other

$$ cov(\epsilon_i, \epsilon_j)= 
\left(\begin{array}{cc}
 0, & i \ne j \\ \sigma^2, & i = j \\ 
\end{array}
\right)
$$


###  Questions to ask ###

- How to find the regression line (estimating $\beta_1$ and $\beta_0$)? 
- How can we quantify the uncertainty of the estimation/prediction?
- How well does the regression line fit the data?

### Ordinary Least Square (OLS) derivation ###
- Here is one way to fit the regression line
- Model: $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
- We want to minimize the square error, $f(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0  - \beta_1 x_i)^2$
- A qudratic function with 2 variables $\beta_0$ and $\beta_1$. Minimum achieved when the derivatives are zero.

- $\frac{\partial f}{\partial\beta_1}= 2 \sum_{i=1}^{n}x_i (\beta_0 + \beta_1 x_i - y_i) = 0$
    - $(\sum_{i=1}^{n} x_i^2)\beta_1 + (\sum_{i=1}^{n} x_i)\beta_0 = \sum_{i=1}^{n}x_i y_i$ or $(\sum_{i=1}^{n} x_i^2)\beta_1 + n \bar X \beta_0 = \sum_{i=1}^{n}x_i y_i$
- $\frac{\partial f}{\partial\beta_0}= 2 \sum_{i=1}^{n} (\beta_0 + \beta_1 x_i - y_i) = 0$
    - $(\sum_{i=1}^{n} x_i)\beta_1 + n \beta_0 = \sum_{i=1}^{n} y_i$
     or $n\bar X \beta_1 + n \beta_0 = n\bar Y$


- Thus 
$$
\begin{array}{ccc}
\hat\beta_1 &=& \frac{\sum_{i=1}^{n}x_i y_i-n\bar X \bar Y }{ \sum_{i=1}^{n} x_i^2-n {\bar X}^2 }
= \frac{\sum_{i=1}^{n}(x_i-\bar X)(y_i-\bar Y)}{\sum_{i=1}^{n}(x_i-\bar X)^2}\\
\hat\beta_0 &=& \bar Y - \hat\beta_1 \bar X
\end{array}
$$
- Notice that we use $\hat\beta_i$ to denoted estimated parameters $\beta_i$
- Let $\hat y_i = \hat\beta_0 + \hat\beta_1 x_i$ be the predicted value of $y_i$

### MLE derivation


* **Likelihood function**: The probability that the data your observations arise from a specific probability distribution defined by a specific set of parameters.

More succinctly, it is the likelihood of the data ($Y$) given the specific predictor variables ($X$) and a mapping fuction ($f()$), including the parameters that describe the distribution of the data. 

Now that last part of the description of the likelihood is the important part. This is why we have the  assumption that $\epsilon_i$ is normally distributed. Remember that the probability distribution function for a normal distribution is

$$ f(x | \mu, \sigma) = \frac{1} {{\sigma \sqrt {2\pi } }} e^{{\frac{ - ( {x - \mu })^2 }{2\sigma^2} }} $$

Now if we assume that $\epsilon_i$ are normally distributed, $X$ is non-random, $\beta_0$ and $\beta_1$ are parameters (fixed numbers), it follows that $Y$ is normally distributed (what is its mean and standard deviation?). Thus we can assume that the likelihood is the product ($\prod$) of all PDFs for $Y$s, which  are random variables from a normal distribution.


$$ \prod_{i=1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma) =  \prod_{i=1}^{n} \frac{1} {{\sigma \sqrt {2\pi } }} e^{{\frac{ - ( {y_i - (\beta_0 + \beta_1x_i) })^2 }{2\sigma^2} }} $$

In plain English, this says that the likelihood is the aggregated probability of observing a particular value of $y$, given the parameters we want to estimate. In this case we want to _maximize_ this function, such that the data has the highest probability of arising from a model with a specific set of values for $\beta_0, \beta_1,$ and $\sigma$.

In practice it is easier to take the log of this function, called the _log likelihood function_ ($logL$), which makes the problem boil down to more simple algebra.

$$  logL(\beta_0, \beta_1, \sigma)= \log \prod_{i=1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma) \\
= \sum_{i=1}^{n} \log  p(y_i | x_i; \beta_0, \beta_1, \sigma) \\ 
= \frac{-n}{2} \log(2\pi) - n \log(\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2 
$$

It is clear that for any $\sigma$, the object function for $\beta$s is the same as the OLS objective function. 

<br>
<center> <b> OLS estimators = MLE estimators! </b> </center>


### Estimation of the variance ###

- Can we estimate the variance $\sigma^2$ of $\epsilon_i$?
- The unbiased estimate of $\sigma^2$ is call **residual mean square (RES. MS)** or **mean square error (MSE)** and is computed as
$$S^2 = \frac{\sum_{i=1}^{n} (y_i-\hat y_i)^2}{n-2}$$
- This measures how far our prediction is from the observed

    <img src="https://fmai-teaching.s3.amazonaws.com/bia652/se.jpg"></img>
    

- The estimate of $\sigma^2$ is call **residual mean square (RES. MS)** or **mean square error (MSE)** and is computed as
$$S^2 = \frac{\sum_{i=1}^{n} (y_i-\hat y_i)^2}{n-2}$$
- $n-2$ is the residual degree of freedom which is sample size - #parameters. Dividing by that instead of $n$ gives **unbiased** estimate of variance.
- Why minus 2? Think what happen if $n=2$? You can fit perfectly! When $n=3$, really you only have one free point. Hence $n-2$.    

### Hypothesis testing and confidence intervals for $\hat\beta_0$ and $\hat\beta_1$
- Now that we have found the estimate parameters, sometimes we may wish to tell if, for example, whether one of them is really zero
- The more general case: ** null hypothesis ** $H_0: \beta_i = \beta_i^0$
- For example, we want to check whether $\beta_1 = 0$ (i.e., $Y$ is not influenced by $X$ at all)
- As usual, we need to estimate the mean and variance/standard deviation of $\hat\beta_0$ and $\hat\beta_1$
- The ** standard errors ** $=$ uncertainty of using $\hat\beta_0$ and $\hat\beta_1$ to estimate $\beta_0$ and $\beta_1$
    - $SE(\hat\beta_0) = S\left(\frac{1}{n}+\frac{\bar X^2}{\sum_{i=1}^{n}(x_i-\bar X)^2}\right)^{1/2}$
    - $SE(\hat\beta_1) = \frac{S}{\left(\sum_{i=1}^{n}(x_i-\bar X)^2\right)^{1/2}}$

- The null hypothesis test statistics for $\hat\beta_i$ is $t_i=\frac{\hat\beta_i - \beta_i^0}{SE(\hat\beta_i)}$, applied to a $t$-distribution of degree $n-2$. 
- Consider it as the z-score when sample size is large!
- It means that given the null hypothesis $H_0$, the probability that we observe the computed $\hat\beta_i$ = the part of the areas of the probability density function of a t-distribution outside of $t_i$
- If the the test statistics (t) is large, or p-value is close to 0, evidence against null hypothesis is strong; if test statistics is small and p-value is large, evidence is weak. 

- e.g., if $\hat\beta_i = 40$, $\beta_i^0 = 4$, $SE(\hat\beta_i) = 3$ and $n = 10$, $t_i = \frac{40 - 4}{3} = 9$
- $P(|x| \ge 9) = P(x \le -9) + P(x \ge 9) =  CDF(-9)+(1-CDF(9)) = 1.85\times 10^{-5}$
- So null hypothesis is rejected

In [None]:
df = 10 - 2
stats.t.cdf(-9, df) + (1 - stats.t.cdf(9, df))

- Confidence intervals for the parameters are
    
$CI(\hat\beta_i) = \beta_i \pm t(\alpha) SE(\hat\beta_i)$

It means that the probability of the real parameter values being outside of this interval is $\alpha$

$[-t(\alpha),t(\alpha)]$ is the $1-\alpha$ confidence interval for $t$-distribution with $n-2$ degree of freedom.

e.g., if n = 4, and we want to find the 95% confidence interval, then $\alpha=0.05$, and we want to find $t(0.05)={CDF}^{-1}(1-0.05/2) = 4.3$


In [None]:
df = 4 - 2
stats.t.ppf(1 - 0.05 / 2, df)

### ANOVA (Analysis of variance) ###
- The purpose is to test the **null hypothesis: $\beta_1 = 0$*
* ($y =$ constant)

| Source |    SS (sum of square)    | df (degree of freedom)              | Mean Square | F |
|:------:|:--------------------------:| :--: | :--------: | :-: |
| Regression | ${ss}_{reg}= \sum_{i=1}^n (\hat y_i - \bar Y)^2$ | $1$ | $\frac{SS_{reg}}{1}$ | $\frac{MS_{reg}}{MS_{res}}$ |
| Residual |  ${ss}_{res}= \sum_{i=1}^n (y_i - \hat y_i)^2$     | $n-2$ | $\frac{SS_{res}}{n-2}$ |   |
| total    | ${ss}_{tot}=\sum_{i=1}^n ( y_i - \bar Y)^2 = {ss}_{reg}+{ss}_{res}$       | $n-1$  |       | | 

- $F = \frac{MS_{reg}}{MS_{res}} = \frac{SS_{reg}}{1}/\frac{SS_{res}}{n-2}$ is the F statistic

- The F distribution is a right-skewed distribution used most commonly in Analysis of Variance. 
- When referencing the F distribution, the numerator degrees of freedom are always given first, so we use $F(1,n-2)$
- The F value tests the null hypothesis that the regression coefficient $\beta_1$ is zero, i.e., $y_i$ can be approximated by $\bar Y$.
- Large $F$ means that null hypothesis is not true (regression parameters nonzero)

__When to use F-test v.s. t-test?__   
- Equivalent when there is only one $x$ variable (simple linear regression)
- When more than one $x$ variables (multiple regression), null hypothesis of the F-test is  
__All $\beta_i$ = 0 other than the intercept__.  
That is, none of the preditors are useful. 
- t-test tests each $x$ variable individually. 

### How good is the fit? The $R^2$ value
- coefficient of determination ($R^2$)
    - Total sum of square 
    $SS_{tot}=\sum_{i=1}^{n} (y_i-\bar Y)^2$
    - Regression sum of square:
    $SS_{reg}=\sum_{i=1}^{n} (\hat y_i- \bar Y)^2$
    -  Sum of squares of residuals:
    $SS_{res}=\sum_{i=1}^{n} (\hat y_i- y_i)^2$
    - coefficient of determination 
    $R^2 = 1-\frac{SS_{res}}{SS_{tot}} = 1-\frac{\sum_{i=1}^{n} (\hat y_i- y_i)^2}{\sum_{i=1}^{n} (y_i-\bar Y)^2}$
    - or $R^2 = \frac{SS_{reg}}{SS_{tot}} = \frac{\sum_{i=1}^{n} (\hat y_i- \bar y)^2}{\sum_{i=1}^{n} (y_i-\bar Y)^2}$
    - Close to 1 means a good fit

### A Small example ###
- $\{x_1,x_2,x_3,x_4\} = \{5,5,10,10\}$
- $\{y_1,y_2,y_3,y_4\} = \{14,17,27,22\}$
- $\bar X = (5+5+10+10)/4=7.5$
- $\bar Y = (14+17+27+22)/4=20$
- $\hat\beta_1 = \frac{\sum_{i=1}^{n}(x_i-\bar X)(y_i-\bar Y)}{\sum_{i=1}^{n}(x_i-\bar X)^2}
=\frac{(5-7.5)*(14-20)+(5-7.5)*(17-20)+(10-7.5)*(27-20)+(10-7.5)*(22-20)}
{(5-7.5)^2+(5-7.5)^2+(10-7.5)^2+(10-7.5)^2}=\frac{45}{25}=1.8$
- $\hat\beta_0 = \bar Y - \hat\beta_1 \bar X = 20 - 1.8*7.5 = 6.5$
- $\hat y_i = \hat\beta_0 + \hat\beta_1*x_i = \{15.5, 15.5, 24.5, 24.5\}$

#### estimate of the variance ####
- **MSE**: $S^2 = \frac{\sum_{i=1}^{n} (y_i-\hat y_i)^2}{n-2}=
\frac{(14-15.5)^2+(17-15.5)^2+(27-24.5)^2+(22-24.5)^2}{4-2}=8.5$
- $S = \sqrt{8.5}=2.92$

####  confidence intervals ####
- **Standard errors**:

$$SE(\hat\beta_0) = S\left(\frac{1}{n}+\frac{\bar X^2}{\sum_{i=1}^{n}(x_i-\bar X)^2}\right)^{1/2}
= 2.92*(1/4+7.5^2/25)^{1/2}=4.61 $$
  
$SE(\hat\beta_1) = \frac{S}{\left(\sum_{i=1}^{n}(x_i-\bar X)^2\right)^{1/2}}
= 2.92/\sqrt{25}=0.58$
- To find 95% confidence intervals for $\hat\beta_0$ and $\hat\beta_1$:

$t(0.05)_{df=4-2=2}=4.3$
    
$ CF(\hat\beta_0) = \beta_0\pm t(0.05)*SE(\hat\beta_0)= 6.5\pm 4.3*4.61 = [-13.33,26.33] $

$CF(\hat\beta_1) = \beta_1\pm t(0.05)*SE(\hat\beta_1)=1.8\pm 4.3*0.58=[-0.71, 4.31]$

#### ANOVA ###
- Total sum of square: $SS_{tot}=\sum_{i=1}^{n} (y_i-\bar Y)^2 = (14-20)^2+(17-20)^2+(27-20)^2+(22-20)^2 = 98$
- Regression sum of square: $SS_{reg}=\sum_{i=1}^{n} (\hat y_i- \bar Y)^2
=(15.5-20)^2+(15.5-20)^2+(24.5-20)^2+(24.5-20)^2=81$
-  Sum of squares of residuals: $SS_{res}=\sum_{i=1}^{n} (\hat y_i- y_i)^2
=(15.5-14)^2+(15.5-17)^2+(24.5-27)^2+(24.5-22)^2=17$

#### How good is the fit $R^2$ ###
- $R^2 = 1-\frac{SS_{res}}{SS_{tot}}=1-\frac{17}{98}=0.83$


#### Hypothesis testing on the small example ####
- Suppose the null hypothesis is $\beta_0^0=0$ and $\beta_1^0=0$
- $t(\beta_0) = \frac{\beta_0-\beta_0^0}{SE(\beta_1)}=(6.5-0)/4.61=1.41$. $p=0.29$. So null hypothesis is not rejected ($\beta_0$ may be zero)
- $t(\beta_1) = \frac{\beta_1-\beta_1^0}{SE(\beta_2)}=(1.8-0)/0.58=3.09$. $p=0.09$. So null hypothesis is not rejected ($\beta_1$ may be zero)
- $F = \frac{SS_{reg}/1}{SS_{res}/(n-2)} = (81/1)/(17/(4-2))= 9.53.$ $p=0.09.$ Null hypothesis $\beta_1=0$ is not rejected.


In [None]:
df = 4 - 2
2 * (1 - stats.t.cdf(1.41, df)), 2 * (1 - stats.t.cdf(3.09, df)), (
    1 - stats.f.cdf(9.53, 1, 2))

Note that the p-values for the t-test and F-test are the same under simple linear regression.   
When $X$ ~ t(n-2), then $X^2$ ~ F(1, n-2).

### Confirming our hand calculation with Python ###

In [1]:
import statsmodels.api as sm

xsmall = [5, 5, 10, 10]
ysmall = [14, 17, 27, 22]
xsmall2 = sm.add_constant(xsmall)
est_small = sm.OLS(ysmall, xsmall2).fit()
est_small.summary()

  warn("omni_normtest is not valid with less than 8 observations; %i "


0,1,2,3
Dep. Variable:,y,R-squared:,0.827
Model:,OLS,Adj. R-squared:,0.74
Method:,Least Squares,F-statistic:,9.529
Date:,"Mon, 12 Dec 2022",Prob (F-statistic):,0.0909
Time:,13:40:29,Log-Likelihood:,-8.5696
No. Observations:,4,AIC:,21.14
Df Residuals:,2,BIC:,19.91
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.5000,4.610,1.410,0.294,-13.334,26.334
x1,1.8000,0.583,3.087,0.091,-0.709,4.309

0,1,2,3
Omnibus:,,Durbin-Watson:,2.059
Prob(Omnibus):,,Jarque-Bera (JB):,0.527
Skew:,-0.0,Prob(JB):,0.768
Kurtosis:,1.221,Cond. No.,25.4


### Diagnosis: Examine the residule plot ###

- Plot of residuals (y axis) against X can show us the linearrity of the data


- It can also show if $\epsilon_i$ are not iid:

<img src="http://3.bp.blogspot.com/-yJygqvJgMK8/UJXjAZymkqI/AAAAAAAAE-Q/vEUKq95msSE/s1600/2012-04-heteroskedasticity_modelling.png" width="600px"></img>


### Confidence interval for a particular prediction ###
- Confidence interval for a particular prediction 
- $\hat y_i = \hat\beta_0 \pm \hat\beta_1 x_i$ 
- CI at $x_i$ is
$$\hat y_i \pm t_{n-2} S\sqrt{1+h_i}$$
- Here $h_i = \frac{1}{n}+\frac{(x_i-\bar X)}{\sum_{j=1}^n\left(x_j - \bar X \right)^2}$, so the further away is $x_i$ from $\bar x$, the looser the prediction
- $[-t_{n-2}, t_{n-2}]$ is the 95% CI of t-distribution with $df=n-2$
- When using a regression model to predict, one should not go beyond the range of observed data
