# Checking the Assumptions

## Data Features that May Invalidate the Model

### Outliers
Can distort estimates, inflate variance, and affect inference.

### High Leverage Points 
Data points with unusual predictor values that can overly influence the fit.

### Heteroscedasticity
May show up as a fan-shaped residual plot.

### Multicollinearity
*Perfect* multicollinearity would cause the model estimation to *fail*. So, in `R`, when we have two variables that are perfectly correlated, one will be effectively removed and all its associated estimates set to `NA`, as we can see below.

In [8]:
data(mtcars)
wt.copy <- mtcars$wt
mod     <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
wt.copy           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


We can see that a warning had been generated saying that one coefficient is `not defined because of singularities`. This is an obvious problem that has been dealt with gracefully. The more insidious issue is when we have multicollinearity that is *high*, rather than *perfect*.

## Model Outputs for Checking Assumptions

### Residuals

### Standardised Residuals

### Predicted Values

### Leverage Values

### The Variance Inflation Factor (VIF)


## Standard Diagnostic Plots

### Residual vs Fitted Plot

### Q-Q Normal Plot

### Scale vs Location Plot

### Residuals vs Leverage Plot
Cook's distance...

`````{admonition} Residuals are Not Independent with Constant Variance
:class: tip
One of the main reasons for distinguishing between *errors* and *residuals* is that the estimation process *changes* the distributional properties of the errors. This means that *errors* and *residuals* are not expected to behave idnetically. So while it is correct to assume

$$
\epsilon_{i} \overset{\text{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^{2}\right),
$$

it is *not* technically correct to assume the same for the *errors*. This is because the estimation procedure can *induce* correlation between the errors and the errors can have non-constant variance, depending upon a property known as *leverage*. We will discuss some of these concepts next week. For now, just note that the residuals can be used as an *approximation* for the errors, but we need to perform some additional checks to make sure that this approximation is reasonable.
`````