# Chapter 3 

## Linear Regression

* Linear regression is the simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1,X_2,...X_p$ is linear.

* True regression functions are NEVER linear. (see video diagram https://www.youtube.com/watch?v=7TgVO_K75EY)

* although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. 

## Linear regression for advertising data
consider the advertising data shown on the next slide.
Questions we might ask:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship between advertising budget and sales?
* which media contribute to sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy amoung the advertising media?


# Simple Linear regression using a single predictor X.

* We assume a model $Y=\beta_0+\beta_1 X+\epsilon,$ Where $\beta_0$ and $\beta_1$ are two unknown constant that represent the intercept and slope also known as coefficients or parameters, and $\epsilon$

* Given some estimates $\beta_0$ and $\beta_1$ for the model coefficients, we predict future sales using$$\hat{y}=\hat{\beta}_0 +\hat{\beta}_1x,$$ where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X=x$. The hat symbol denotes an estimated value.  


## Estimation of the parameters by least squares
* Let $\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i$ be the prediction for $Y$ based on the $i$th valueof $X$. then $e_i=y_i-\hat{y}_i$ represents the $i$th residual.

* we define the residual sum of squares (RSS) as $$RSS=e_{1}^{2}+e_{2}^{2}+....+e_{n}^{2},$$ or equivalently as $$RSS=(y_1-\hat{\beta}_0-\hat{\beta}_1 x_1)^2+(y_2-\hat{\beta}_0-\hat{\beta}_1 x_2)^2+...+(y_n-\hat{\beta}_0-\hat{\beta}_1 x_n)^2.$$


* The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. the minimizing values can be shown to be $$\hat{\beta}_1 = \frac{\sum_{i=1(x_i-\bar{x})(y_i-\bar{y})}^{n}}{\sum_{i=1(x_i-\bar{x})^2}^{n}}$$, where $\bar{y}\equiv\frac{1}{n}\sum_{i=1^y_i}^{n}$ and $\bar{x}\equiv\frac{1}{n}\sum_{i=1^x_i}^{n}$ are the sample means. 

### Example: Advertising data (see slide)
The least squares fit for the regression of sales onto TV. In this casea lnear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot. 

### Accessing the Accuracy of the Coefficient Estimates

* The standard error $SE$ of an estimator reflects how it varies under repreated sampling. We have $$SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}} , SE(\hat{\beta}_0)^2 = \sigma^2\left[\frac{\bar{x}^2}{\sum_{i=1^{(x_i-\bar{x})^2}}^{n}}\right]$$, Where $\sigma^2 =Var(\epsilon)$

* these Standard error can be used to compute confidence intervals. a 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form $$\beta_1\pm2\cdot SE(\hat{\beta}_1).$$

### Confidence intervals - continued

That is, there is approximately a 95% chance that the interval $$\left[\hat{\beta}_1-2 2\cdot SE(\hat{\beta}_1) , \hat{\beta}_1+2 \cdot SE(\hat{\beta}_1)\right]$$ will contain the true value of $\beta_1$ (under a scenario where we got repeated sample like the present sample)

For advertising data, the  95% confidence interval for $\beta_1$ is $[0.042,0.053].$

### Hypothesis testing
* Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of: 

    $H_0$: There is no relationship between $X$ and $Y$ verses the alternative hypothesis. ($\beta_1=0$)
    
    $H_A$: There is some relationship between $X$ and $Y$.($\beta_1\neq0$)



* Mathmatically, this corresponds to testing $$H_0 : \beta_1=0$$ verses $$H_0 : \beta_1\neq0,$$ since if  $\beta_1=0$ then the model reduces to $Y = \beta_0 +\epsilon,$ and $X$ is not associated with $Y$. 

* to test the null hypotthesis, we compute a t-statistic, given by $$t=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}$$

* this will have a t-distribution with $n-2$ degrees of freedom, assuming $\beta_1 = 0$. 

* Using statistical software, it is easy to compute the probability of observing any value equal to $|t|$ or larger. We call this probability the p-value. 

### results for advertising data
![image.png](attachment:image.png)
The results of t-statistic and p-value shows tthat TV advertising has no has a very strong effect on sales . 

# Acessing the Overall Accuracy of the Model

* We compute the Residual Standard Error $$RSE =\sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$ Where the residual sum-of-squares is $RSS =\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$

* R-squared or fraction of variance explained is $$R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$$ where $TSS=\sum_{i=1}^{n}(y_i-\hat{y})^2$ is the total sum of squares.
* It can be shown that in this simple linear regression setting that $R^2 =r^2$, where $r$ is the correlation between $X$ and $Y$: $$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}$$

### Advertising data results 
![image.png](attachment:image.png)

# Multiple Linear Regression
*  Here our model is $$Y= \beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_p X_p+\epsilon$$

* We interpret $\beta_j$ as the average effect on $Y$ of a one unit increase in $X_j$, holding all other predictors fixed. In the advertising example, the model becomes $$sales=\beta_0+\beta_1* TV+\beta_2*Radio+\beta_3*Newspaper+\epsilon$$

### Interpreting regression coefficients 
* the ideal scenario is when the predictions are uncorrelated - a balanced design:
 - each coefficeint can be estimated and tested serately. 
 - Interpretations such as " a unit of change in $X_j$ is associated with $\beta_j$ change in $Y$,while all the other variables stay fixed", are possible. 
* Correlations amoungst predictors cause problems:
 - The variance of all coefficients tends to increase, sometimes dramatically
 - Interpretations become hazardous - when $X_j$ changes, everything else changes. 
* Claims of causality should be avoided for observational data. 

### The woes of (interpreting) regression coefficients.  

##### "Data Analysis and Regression" Mosteller and Turkey 1977
* A regesssion coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together! 

* Example:$Y$ total amount of change in your pocket; $X_1$ = # of coins;$X_2$ = # of pennies, nickels, and dimes. By itself, regression coefficient of $Y$ on$X_2$ will be $>0$. But how about with $X_1$ in model?

* $Y$ = number of tackles by a football player in a season; $W$ and $H$ are his weight and height. Fitted regression model is $\hat{Y} =\beta_0+.50W-.10H$. Howe do we interpret  $\hat{\beta}_2<0$? 

### Two Quotes by famous Statisticians

"Essentially, all models are wrong,but some are useful." - George Box

"The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively." - Fre Mosteller and John Turkey, paraphrasing George Box

### Estimation and Prediction for Multiple Rergession 

* Given the estimates $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ we can make predictions using this formula $$ \hat{y}=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+...+\hat{\beta}_px_p.$$

*We estimate $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ as the values that minimize the sum of squared residuals $$RSS=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 = \sum_{i=1}^{n}(y_i-\hat{\beta}_0-\hat{\beta}_1x_{i1}+\hat{\beta}_2x_{i2}-...-\hat{\beta}_px_{ip})^2$$. This is done using standard stasistical software. The values $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,....\hat{\beta}_p,$ that mimize RSS are the multiple least squares regression coeficient estimates. 

![image.png](attachment:image.png)

### Some important questions

1. Is at least one ofthe predictors  $X_1,X_2,..,X_p$ useful in predicting the response?
2. Do all the predictors help to explain $Y$, or is only a subset of the predictors userful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction? 


for the first question, we can use the F-Statistic $$F=\frac{(TSS-RSS)/p}{RSS/(n-p-1)}~F_{p,n-p-1}$$

![image.png](attachment:image.png)

### Deciding on the important variables 

* The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.

* The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. 

* However we often can't examine all possible models, since they are $2^p$ of them; for example when $p=40$ there are over a billion models! Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next. 

## Forward Selection

* Begin with the $Null Model$ - a model that contains an intercept but no predictors. 

* Fit $p$ simple linear regression and add to the null model the variable that results in the lowest RSS amoungst all two-variable models.  

* Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.

## Backward Selection
* Startwith all variables model.
* Remove the variable with the largest p-value --That is, the variable that is the least statistically significant. 
* The new $(p-1)$ variable model is fit, and the variable with the largest $p-value$ is removed.
* Continue until a stopping rule is reached. For instance, We May Stop when all remaing variables have a significant $p-value$ defined by some significance threshold. 

#### Model selection continued

* Later on we discuss more systematic criteria for choosing asn "optiomal" member in the path of models produced by forward or backward stepwise selection.

* These include Mallow's$C_p$, Akaike information criterion $(AIC)$, Bayesian information criterion $(BIC)$, adjusted $R^2$ and Corss-validation $(CV)$.

#### Other considerations in the Regression Model

##### Qualitivie Predictors
* Some predictors are not quantitative but are qualitative, taking a discrete set of values. 
* these are also called categorical predictors or facotr variables. 

* see for example the scatterplot matrix of the credit card data. In addition to the 7 quantitative variable shown, there are four qualitative variables: gender, student ( student status),status (marital status), and ethnicity (Caucasian, African American (AA) or Asian).

 ![image.png](attachment:image.png)

Example: investigate difference in credit card balance between males and females, ignoring the other variables/ we create a new variable


![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Qualitive predictors with more than two levels

* with more than two levels, we create additional dummy variables. For example, for the ethnicity varaible we create two dummy variables. The first could be 

![image.png](attachment:image.png)

(cant seem to replicate this format used in native latex. screenshots will have to do.)

### Qualitative predictors with more than two levels - continued. 

* Then both of these variables can be used in the regression equation in order to obtain the model

![image.png](attachment:image.png)

* there will always be one fewer dummy variable than the number of levels. The level no dummy variable -- (AA in this example) is known as the baseline.


![image.png](attachment:image.png)

# Extensions of the linear model

Removing the additive assumption: interactions and nonlinearity
Interactions:
* in ourprevious analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. 
* for example the linear model $$\hat{sales} = \beta_0 + \beta_1 * TV+ \beta_2 * radio +\beta_3 * newspaper$$ States that the average effect on $sales$ of a one-unit increase in $TV$ is always $\beta_1$, regardless of the amount spent on $radio$.

* But suppose that spending money on radio Advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases. 

* In this situation, given a fixed budget of 100,000 dollars, spending half on $radio$ and half on $TV$ may increase $sales$ more than allocating the entire amount to either $TV$ or to $Radio$.

* In marketing, this is known as a $synergy$ effect, and in statistics it is reffered to as an $interaction$ effect. 

![image.png](attachment:image.png)

# Modeling interactions - Advertising data

Model takes from $$Sales=\beta_0+\beta_1*TV\beta_2*radio+\beta_3*(radio*TV)+\epsilon$$ $$=\beta_0+(\beta_1+\beta_3*radio)*TV+\beta_2*radio+\epsilon$$ Results:
![image.png](attachment:image.png)

# Interpretation 
* The results in this table suggests that interactions are important. 

* The p-value for the interaction term $TV*radio$ is extremely low, indicating that there is strong evidence for $H_A:\beta_3\not\equiv0$.

* The $R^2$ for the interaction model is $96.8$%, compared to only $89.7$% for the model that predicts $sales$ using $TV$ and $radio$ without an interaction term. 

* This means that $(96.8-89.7)/(100-89.7)=69$% of the variability in $sales$ that remains after fitting the additive model has been explained by the interraction term.  

* The coefficient estimates in the table suggest that an increase in TV adverising of 1,000 dollars is asociated with an increased sales of $(\hat{\beta}_1+\hat{\beta}_3*radio)*1000=19+1.1*radio$ units. 

* An increase in radio advertising of 1000 dollars will be associated with an increase in sales of $(\hat{\beta}_2+\hat{\beta}_3*TV)*1000=29+1.1*TV$ units.

### Hierarchy 
* Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case,tv and radio) do not. 

* The $hierarchy$ $principle$:

If we include an interaction in a model, we should also include the main effects, even if the p-valuesn associated with their coefficients are not significant. 

* the rationale for this principle is that interaction are hard to interpret in a model without main effects - their meaning is changed.

* Specifically, the interaction terms also contain main effects, if the model has no main effect terms.

### Interactions between qualititve and quntitive variables

Consider the $Credit$ data set, and suppose that we wish to predict $balance$ using $income$ (quantitive) and $student$ (qualitive). Without an interaction term, the model takes the form:
![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### See text Section 3.33 for information regarding Outliers, Non-consistent variance of error terms, High leverage points, and Collinearity

## Generalizations of the Linear Model

In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit:

* Classificaion problems: logistic regression, support vector machines

* Non-linearity: kernal smoothing, splines and generalized additive models; nearest neighbor methods. 

* interactions: Tree-based methods, bagging, random forests and boosting (these are also capture non-linearities)

* Regularized fitting: Ridge regression and lasso.

# Chapter 4

## Classification

* Qualititve varaibles take values in an unordered set $C$, such as: 

$eye \ color\epsilon \{brown,blue,green\}$

$email\epsilon \{spam,ham\}$

* given a featurevector $X$ and a qualitiative response $Y$ taking values in the set $C$, the classification task is to build a function $C(X)$ that takes as input the feature vector $X$ and predicts it's value for $Y$; i.e. $C(X) \epsilon C$.

* Often we are more interested in estimating the probabilites that $X$ belings to each category in $C$. For example it is more valuable to have an estimate of the probability that an insurance claim is fraudulent than a classification fraudulent or not. 
![image.png](attachment:image.png)

## Can we use Linear Regression?

* Suppose for the $Default$ classificaiton task that we code this:
![image.png](attachment:image.png)

* Can we simply perform a lniear regression of $Y$ on $X$ and classify as $Yes$ if $\hat{Y} >0.5$? 
    * In this of a binary outcome, linear regression does a good job as a classifier, and is equivalent to $linear\ discriminant\ alalysis$ which we discuss later.
    * since in the population $E(Y|X=x)=Pr(=1|X=x)$, we might think that regression is perfect for this task. 
    * $HOWEVER$, Linear regression might produce probabilites less than 0 or bigger than one. Logistic regression is more appropriate. 
    

    

![image.png](attachment:image.png)

Now suppose we have a variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms. 
![image.png](attachment:image.png)
This coding suggests an ordering, and in fact implies that the difference between $stroke$ and $drug\ overdose$ is tha same as between $drug\ overdose$ and $epileptic\ seizure$. 

Because of this Linear regression is $NOT$ appropriate here. $Multiple\ Logistic\ Regression$ or $Discriminant\ Analysis$ are more appropriate.

### case control sampling and logistic regression

* In South African data, there are 160cases, 302 controls - $\tilde{\pi}=0.35$ are cases. Yet the prevalence of MI in this region is $\pi=0.05$.

* With case-control sample, we can estimate the regression parameters $\beta_j$ accurately (if our model is correct); the constant term $\beta_0$ is incorrect.

* We can correct the estimated intercept by a simple transformation:
$$\hat{\beta}_{0}^{\ast}=\hat{\beta}_{0}+log{\frac{\pi}{1-\pi}}-log{\frac{\tilde\pi}{1-\tilde\pi}}$$

* often cases are rare and we take them all; up to five times that number of controls is suffiient.

![image.png](attachment:image.png)

### Logistic regression with more than two classes

So far we discussed logistic regressionwith two classes. It is easily generalized to more than two classes. One version (used un the R package $glmnet$) has the symmetric form $$Pr(Y=k|X) = \frac{e^{\beta_{0k}+\beta_{1k}X_1+...+\beta_{pk}+X_p}}{\sum_{l=1}^{K} e^{\beta_{0l}+\beta_{1l}X_1+...+\beta_{pl}+X_p}}$$

Here there is a linear function for each class.
(The Mathier students will recognize that some cancellation is possible, and onl  $K-1$ linear functions are needed as in 2-class logistic regression.)

Multiclass logistic regression is also referred to as multinominal regression. 

## Discriminant Analysis

Here the approach is to model te distribution of $X$ in eash of the classes seperately, and then use $Bayes\ theorem$ to flip things around and obtain $Pr(Y|X)$.

When we use normal (Gaussian) distributions for each Class, this leads to linear or quadratic discriminiant analysis.

However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions. 

### Bayes theorem for classification
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probablistic modeling. Here we focus on a simple result, known as Bayes theorem.

$$Pr(Y=k|X=x)=\frac{Pr(X=x|Y=k)*Pr(Y=k)}{Pr(X=x)}$$

One writes slightly differently for discriminant analysis:
$$Pr(Y=k|X=x)=\frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K}\pi_{l}f_{l}(x)}$$, where 

* $f_K(x)=Pr(X=x|Y=k)$ is the density for $X$ in class $k$. Here we will use normal densities for these, seperately in each class. 

* $\pi_k=Pr(Y=k)$ is the marginal or $prior$ probability for class $k$. 

![image.png](attachment:image.png)

### Why discriminent analysis?

* When classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. 

* if $n$ is small and the distribution of the predictors $X$ is aproximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. 

* Linear discriminant alalysis is popular when we have more than two response classes because it also provides low-dimensional views of the data. 

# HEADS UP

 * The next few cells are a bunch of screenshots. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)