# Chapter 3 

## Linear Regression

* Linear regression is the simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1,X_2,...X_p$ is linear.

* True regression functions are NEVER linear. (see video diagram https://www.youtube.com/watch?v=7TgVO_K75EY)

* although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. 

## Linear regression for advertising data
consider the advertising data shown on the next slide.
Questions we might ask:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship between advertising budget and sales?
* which media contribute to sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy amoung the advertising media?


# Simple Linear regression using a single predictor X.

* We assume a model $Y=\beta_0+\beta_1 X+\epsilon,$ Where $\beta_0$ and $\beta_1$ are two unknown constant that represent the intercept and slope also known as coefficients or parameters, and $\epsilon$

* Given some estimates $\beta_0$ and $\beta_1$ for the model coefficients, we predict future sales using$$\hat{y}=\hat{\beta}_0 +\hat{\beta}_1x,$$ where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X=x$. The hat symbol denotes an estimated value.  


## Estimation of the parameters by least squares
* Let $\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i$ be the prediction for $Y$ based on the $i$th valueof $X$. then $e_i=y_i-\hat{y}_i$ represents the $i$th residual.

* we define the residual sum of squares (RSS) as $$RSS=e_{1}^{2}+e_{2}^{2}+....+e_{n}^{2},$$ or equivalently as $$RSS=(y_1-\hat{\beta}_0-\hat{\beta}_1 x_1)^2+(y_2-\hat{\beta}_0-\hat{\beta}_1 x_2)^2+...+(y_n-\hat{\beta}_0-\hat{\beta}_1 x_n)^2.$$


* The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. the minimizing values can be shown to be $$\hat{\beta}_1 = \frac{\sum_{i=1(x_i-\bar{x})(y_i-\bar{y})}^{n}}{\sum_{i=1(x_i-\bar{x})^2}^{n}}$$, where $\bar{y}\equiv\frac{1}{n}\sum_{i=1^y_i}^{n}$ and $\bar{x}\equiv\frac{1}{n}\sum_{i=1^x_i}^{n}$ are the sample means. 

### Example: Advertising data (see slide)
The least squares fit for the regression of sales onto TV. In this casea lnear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot. 

### Accessing the Accuracy of the Coefficient Estimates

* The standard error $SE$ of an estimator reflects how it varies under repreated sampling. We have $$SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}} , SE(\hat{\beta}_0)^2 = \sigma^2\left[\frac{\bar{x}^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}}\right]$$, Where $\sigma^2 =Var(\epsilon)$

* these Standard error can be used to compute confidence intervals. a 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form $$\beta_1\pm2\cdot SE(\hat{\beta}_1).$$

### Confidence intervals - continued

That is, there is approximately a 95% chance that the interval $$\left[\hat{\beta}_1-2 2\cdot SE(\hat{\beta}_1) , \hat{\beta}_1+2 \cdot SE(\hat{\beta}_1)\right]$$ will contain the true value of $\beta_1$ (under a scenario where we got repeated sample like the present sample)

For advertising data, the  95% confidence interval for $\beta_1$ is $[0.042,0.053].$

### Hypothesis testing
* Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of: 

    $H_0$: There is no relationship between $X$ and $Y$ verses the alternative hypothesis. ($\beta_1=0$)
    
    $H_A$: There is some relationship between $X$ and $Y$.($\beta_1\neq0$)



* Mathmatically, this corresponds to testing $$H_0 : \beta_1=0$$ verses $$H_0 : \beta_1\neq0,$$ since if  $\beta_1=0$ then the model reduces to $Y = \beta_0 +\epsilon,$ and $X$ is not associated with $Y$. 

* to test the null hypotthesis, we compute a t-statistic, given by $$t=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}$$

* this will have a t-distribution with $n-2$ degrees of freedom, assuming $\beta_1 = 0$. 

* Using statistical software, it is easy to compute the probability of observing any value equal to $|t|$ or larger. We call this probability the p-value. 

### results for advertising data
![image.png](attachment:image.png)
The results of t-statistic and p-value shows tthat TV advertising has no has a very strong effect on sales . 

# Acessing the Overall Accuracy of the Model

* We compute the Residual Standard Error $$RSE =\sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$ Where the residual sum-of-squares is $RSS =\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$

* R-squared or fraction of variance explained is $$R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$$ where $TSS=\sum_{i=1}^{n}(y_i-\hat{y})^2$ is the total sum of squares.
* It can be shown that in this simple linear regression setting that $R^2 =r^2$, where $r$ is the correlation between $X$ and $Y$: $$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}$$

### Advertising data results 
![image.png](attachment:image.png)

# Multiple Linear Regression
*  Here our model is $$Y= \beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_p X_p+\epsilon$$

* We interpret $\beta_j$ as the average effect on $Y$ of a one unit increase in $X_j$, holding all other predictors fixed. In the advertising example, the model becomes $$sales=\beta_0+\beta_1* TV+\beta_2*Radio+\beta_3*Newspaper+\epsilon$$

### Interpreting regression coefficients 
* the ideal scenario is when the predictions are uncorrelated - a balanced design:
 - each coefficeint can be estimated and tested serately. 
 - Interpretations such as " a unit of change in $X_j$ is associated with $\beta_j$ change in $Y$,while all the other variables stay fixed", are possible. 
* Correlations amoungst predictors cause problems:
 - The variance of all coefficients tends to increase, sometimes dramatically
 - Interpretations become hazardous - when $X_j$ changes, everything else changes. 
* Claims of causality should be avoided for observational data. 

### The woes of (interpreting) regression coefficients.  

##### "Data Analysis and Regression" Mosteller and Turkey 1977
* A regesssion coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together! 

* Example:$Y$ total amount of change in your pocket; $X_1$ = # of coins;$X_2$ = # of pennies, nickels, and dimes. By itself, regression coefficient of $Y$ on$X_2$ will be $>0$. But how about with $X_1$ in model?

* $Y$ = number of tackles by a football player in a season; $W$ and $H$ are his weight and height. Fitted regression model is $\hat{Y} =\beta_0+.50W-.10H$. Howe do we interpret  $\hat{\beta}_2<0$? 

### Two Quotes by famous Statisticians

"Essentially, all models are wrong,but some are useful." - George Box

"The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively." - Fre Mosteller and John Turkey, paraphrasing George Box

### Estimation and Prediction for Multiple Rergession 

* Given the estimates $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ we can make predictions using this formula $$ \hat{y}=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+...+\hat{\beta}_px_p.$$

*We estimate $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ as the values that minimize the sum of squared residuals $$RSS=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 = \sum_{i=1}^{n}(y_i-\hat{\beta}_0-\hat{\beta}_1x_{i1}+\hat{\beta}_2x_{i2}-...-\hat{\beta}_px_{ip})^2$$. This is done using standard stasistical software. The values $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ that mimize RSS are the multiple least squares regression coeficient estimates. 

![image.png](attachment:image.png)

### Some important questions

1. Is at least one ofthe predictors  $X_1,X_2,..,X_p$ useful in predicting the response?
2. Do all the predictors help to explain $Y$, or is only a subset of the predictors userful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction? 


for the first question, we can use the F-Statistic $$F=\frac{(TSS-RSS)/p}{RSS/(n-p-1)}~F_{p,n-p-1}$$

![image.png](attachment:image.png)

### Deciding on the important variables 

* The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.

* The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. 

* However we often can't examine all possible models, since they are $2^p$ of them; for example when $p=40$ there are over a billion models! Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next. 

## Forward Selection

* Begin with the $Null Model$ - a model that contains an intercept but no predictors. 

* Fit $p$ simple linear regression and add to the null model the variable that results in the lowest RSS amoungst all two-variable models.  

* Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.

## Backward Selection
* Startwith all variables model.
* Remove the variable with the largest p-value --That is, the variable that is the least statistically significant. 
* The new $(p-1)$ variable model is fit, and the variable with the largest $p-value$ is removed.
* Continue until a stopping rule is reached. For instance, We May Stop when all remaing variables have a significant $p-value$ defined by some significance threshold. 

#### Model selection continued

* Later on we discuss more systematic criteria for choosing asn "optiomal" member in the path of models produced by forward or backward stepwise selection.

* These include Mallow's$C_p$, Akaike information criterion $(AIC)$, Bayesian information criterion $(BIC)$, adjusted $R^2$ and Corss-validation $(CV)$.

#### Other considerations in the Regression Model

##### Qualitivie Predictors
* Some predictors are not quantitative but are qualitative, taking a discrete set of values. 
* these are also called categorical predictors or facotr variables. 

* see for example the scatterplot matrix of the credit card data. In addition to the 7 quantitative variable shown, there are four qualitative variables: gender, student ( student status),status (marital status), and ethnicity (Caucasian, African American (AA) or Asian).

 ![image.png](attachment:image.png)

Example: investigate difference in credit card balance between males and females, ignoring the other variables/ we create a new variable


![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Qualitive predictors with more than two levels

* with more than two levels, we create additional dummy variables. For example, for the ethnicity varaible we create two dummy variables. The first could be 

![image.png](attachment:image.png)

(cant seem to replicate this format used in native latex. screenshots will have to do.)

### Qualitative predictors with more than two levels - continued. 

* Then both of these variables can be used in the regression equation in order to obtain the model

![image.png](attachment:image.png)

* there will always be one fewer dummy variable than the number of levels. The level no dummy variable -- (AA in this example) is known as the baseline.


![image.png](attachment:image.png)