# Checking the Assumptions
Now that we have established the nature of testing assumptions visually, as well as introducing the core assumptions made by a linear model, we can start investigating *how* to check the assumptions. Before getting there, we first need to discuss some data features that are not assumptions per-se, but which can point to invalid assumptions, or can unduly influence the model fit. We will then look at information we can extract from the model to help us assess whether these data features are present, as well as helping us assess the model assumptions. Finally, we will examine a standard selection of plots that can be used for diagnostic purposes.

## Data Features that May Invalidate the Model

### Outliers
Can distort estimates, inflate variance, and affect inference.

### High Leverage Points 
Data points with unusual predictor values that can overly influence the fit.

### Multicollinearity
*Perfect* multicollinearity would cause the model estimation to *fail*. So, in `R`, when we have two variables that are perfectly correlated, one will have all its associated estimates set to `NA`, as we can see below.

In [8]:
data(mtcars)
wt.copy <- mtcars$wt
mod     <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
wt.copy           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


We can see that a warning had been generated saying that one coefficient is `not defined because of singularities`. This is an obvious problem that has been dealt with gracefully. The more insidious issue is when we have multicollinearity that is *high*, rather than *perfect*. We can generate a predictor with a high correlation (but not perfect correlation) by simply adding some random noise to the copy of the `wt` variable, like so:

In [8]:
set.seed(666)
data(mtcars)
wt      <- mtcars$wt
wt.copy <- wt + rnorm(n=length(wt), mean=0, sd=0.2)
print(cor(wt,wt.copy))

[1] 0.9734991


So, we can see that `wt` and `wt.copy` have a correlation of $r = 0.97$, which would be considered very high. Now let us see what happens when these are both in the same model

In [14]:
mod.multicol <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod.multicol)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2881 -2.4767 -0.0536  1.6162  6.6278 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.561      1.938  19.377   <2e-16 ***
wt            -6.966      2.467  -2.824   0.0085 ** 
wt.copy        1.559      2.309   0.675   0.5049    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.074 on 29 degrees of freedom
Multiple R-squared:  0.7567,	Adjusted R-squared:  0.7399 
F-statistic: 45.09 on 2 and 29 DF,  p-value: 1.259e-09


Notice that the standard error of `wt` has *increased* dramatically from 0.559 to 2.467. This is effectively a *four-fold* increase in uncertainty, with the $t$-statistic close to *one third* of its original value, going from -9.559 to -2.824. This is often referred to as the standard errors *blowing-up*, due to the increased uncertainty that the correlation introduces. We will discuss ways of diagnosing this below and then will discuss some "solutions" later in the lesson.

## Model Outputs for Checking Assumptions

### Residuals

### Standardised Residuals

### Predicted Values

### Leverage Values

### The Variance Inflation Factor (VIF)


In [13]:
library(car)
print(vif(mod.multicol))

       wt   wt.copy        hp 
21.917371 19.766097  1.826262 


In general, we can take:

- VIF = 1 - No multicollinearity
- 1 < VIF < 5 - Moderate multicollinearity. Some caution may be needed, but not necessarily a problem for the model
- VIF > 5 - Very high multicollinearity. Almost certainly an issues and some mitigation will be needed.

## Standard Diagnostic Plots

### Residual vs Fitted Plot

### Q-Q Normal Plot

### Scale vs Location Plot

### Residuals vs Leverage Plot
Cook's distance...

`````{admonition} Residuals are Not Independent with Constant Variance
:class: tip
One of the main reasons for distinguishing between *errors* and *residuals* is that the estimation process *changes* the distributional properties of the errors. This means that *errors* and *residuals* are not expected to behave idnetically. So while it is correct to assume

$$
\epsilon_{i} \overset{\text{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^{2}\right),
$$

it is *not* technically correct to assume the same for the *errors*. This is because the estimation procedure can *induce* correlation between the errors and the errors can have non-constant variance, depending upon a property known as *leverage*. We will discuss some of these concepts next week. For now, just note that the residuals can be used as an *approximation* for the errors, but we need to perform some additional checks to make sure that this approximation is reasonable.
`````