# Chapter 3 

## Linear Regression

* Linear regression is the simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1,X_2,...X_p$ is linear.

* True regression functions are NEVER linear. (see video diagram https://www.youtube.com/watch?v=7TgVO_K75EY)

* although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. 

## Linear regression for advertising data
consider the advertising data shown on the next slide.
Questions we might ask:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship between advertising budget and sales?
* which media contribute to sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy amoung the advertising media?


# Simple Linear regression using a single predictor X.

* We assume a model $Y=\beta_0+\beta_1 X+\epsilon,$ Where $\beta_0$ and $\beta_1$ are two unknown constant that represent the intercept and slope also known as coefficients or parameters, and $\epsilon$

* Given some estimates $\beta_0$ and $\beta_1$ for the model coefficients, we predict future sales using$$\hat{y}=\hat{\beta}_0 +\hat{\beta}_1x,$$ where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X=x$. The hat symbol denotes an estimated value.  


## Estimation of the parameters by least squares
* Let $\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i$ be the prediction for $Y$ based on the $i$th valueof $X$. then $e_i=y_i-\hat{y}_i$ represents the $i$th residual.

* we define the residual sum of squares (RSS) as $$RSS=e_{1}^{2}+e_{2}^{2}+....+e_{n}^{2},$$ or equivalently as $$RSS=(y_1-\hat{\beta}_0-\hat{\beta}_1 x_1)^2+(y_2-\hat{\beta}_0-\hat{\beta}_1 x_2)^2+...+(y_n-\hat{\beta}_0-\hat{\beta}_1 x_n)^2.$$


* The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. the minimizing values can be shown to be $$\hat{\beta}_1 = \frac{\sum_{i=1(x_i-\bar{x})(y_i-\bar{y})}^{n}}{\sum_{i=1(x_i-\bar{x})^2}^{n}}$$, where $\bar{y}\equiv\frac{1}{n}\sum_{i=1^y_i}^{n}$ and $\bar{x}\equiv\frac{1}{n}\sum_{i=1^x_i}^{n}$ are the sample means. 

### Example: Advertising data (see slide)
The least squares fit for the regression of sales onto TV. In this casea lnear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot. 

### Accessing the Accuracy of the Coefficient Estimates

* The standard error $SE$ of an estimator reflects how it varies under repreated sampling. We have $$SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}} , SE(\hat{\beta}_0)^2 = \sigma^2\left[\frac{\bar{x}^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}}\right]$$, Where $\sigma^2 =Var(\epsilon)$

* these Standard error can be used to compute confidence intervals. a 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form $$\beta_1\pm2\cdot SE(\hat{\beta}_1).$$

### Confidence intervals - continued

That is, there is approximately a 95% chance that the interval $$\left[\hat{\beta}_1-2 2\cdot SE(\hat{\beta}_1) , \hat{\beta}_1+2 \cdot SE(\hat{\beta}_1)\right]$$ will contain the true value of $\beta_1$ (under a scenario where we got repeated sample like the present sample)

For advertising data, the  95% confidence interval for $\beta_1$ is $[0.042,0.053].$

### Hypothesis testing
* Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of: 

    $H_0$: There is no relationship between $X$ and $Y$ verses the alternative hypothesis. ($\beta_1=0$)
    
    $H_A$: There is some relationship between $X$ and $Y$.($\beta_1\neq0$)



* Mathmatically, this corresponds to testing $$H_0 : \beta_1=0$$ verses $$H_0 : \beta_1\neq0,$$ since if  $\beta_1=0$ then the model reduces to $Y = \beta_0 +\epsilon,$ and $X$ is not associated with $Y$. 

* to test the null hypotthesis, we compute a t-statistic, given by $$t=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}$$

* this will have a t-distribution with $n-2$ degrees of freedom, assuming $\beta_1 = 0$. 

* Using statistical software, it is easy to compute the probability of observing any value equal to $|t|$ or larger. We call this probability the p-value. 

### results for advertising data
![image.png](attachment:image.png)
The results of t-statistic and p-value shows tthat TV advertising has no has a very strong effect on sales . 

# Acessing the Overall Accuracy of the Model

* We compute the Residual Standard Error $$RSE =\sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$ Where the residual sum-of-squares is $RSS =\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$

* R-squared or fraction of variance explained is $$R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$$ where $TSS=\sum_{i=1}^{n}(y_i-\hat{y})^2$ is the total sum of squares.
* It can be shown that in this simple linear regression setting that $R^2 =r^2$, where $r$ is the correlation between $X$ and $Y$: $$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}$$

### Advertising data results 
![image.png](attachment:image.png)