# Checking the Assumptions
Now that we have established the nature of testing assumptions visually, as well as introducing the core assumptions made by a linear model, we can start investigating *how* to check the assumptions. Before getting there, we first need to discuss some data features that are not assumptions per-se, but which can point to invalid assumptions, or can unduly influence the model fit. We will then look at information we can extract from the model to help us assess whether these data features are present, as well as helping us assess the model assumptions. Finally, we will examine a standard selection of plots that can be used for diagnostic purposes.

## Data Features that May Invalidate the Model
To begin with, we will discuss certain features of data that we need to be aware of. In some cases, these can point to potential assumption violations, but in others these can simply be aspects that disort or overly-influence the results of our models.

### Outliers
Outliers are datapoints that sit further away from some reference point than we would expect. These are sometimes termed *extreme* datapoints. Such data are problematic because they can distort estimates, inflate variance and affect inference. For instance, data that sits far away from other datapoints will have the effect of *pulling* a mean towards it, biasing the ability of the mean to accurately reflect the data as a whole.

In the context of our model, an outlier is a point that *does not fit*, in the sense that it is far away from our prediction. Thus, the point has a *large residual*. There could, of course, be many reasons for this. It could indicate a problem with out model, implying that the point *would fit* if we chose a different mean function. It could also indicate a problem with our data, implying there was some mistake when the data was collected. However, it is important to remember that an outlier does not make the data *wrong* and that an outlier is only defined as such *relative* to our model prediction. The same data may not be an outlier in another model. Furthermore, changing the model fit via removal of outlying points may end up making other points outliers. Indeed, *removal* of outliers is the most extreme of solutions and should be considered a *last resort*[^NASA-foot]. 

We will discuss the definition and identification of outliers further below. However, it is worth considering what should be done if a suspected outlier is detected:

1. Check the data itself to make sure there has been no data-entry mistakes. For instance, placing a decimal-point in the wrong position, giving rise to an erroneously small or large value.
2. If the data entry seems correct, consider the physical context. Although the value may be extreme relative to the rest of the data, is it unreasonably so? For instance, a very large score on a depression questionnaire must be assumed genuine, unless you have evidence that this particular subject was completing the questionnaire incorrectly. We will see ways later to assess whether there are unusual *combinations* of predictor values, which *could* point to such a situation.
3. If you have good evidence to indicate that the data is wrong, then removal would be appropriate. However, this is rarely the case and outliers should not just be removed for the sake of it.
4. If the datapoints are extreme, but removal is not justified, a method such as *robust regression* can be used, as discussed at the end of this lesson. What should be avoided is the classic Psychology trick of making an outlier less extreme by changing its value to lie just above the second most extreme point in the data. This is data tampering and should be avoided at all costs.

The main point of all this is that outliers are *relative* to the model, are not automatically *wrong* and must be considered *in context*. Removal should be a last resort and data tampering should be avoided. If we have a point that does not fit the model, we should consider why the model may be wrong first, rather than the data. For many research areas, the most useful points are those that lie in the extremes. If we are interested in depression then surely the *most* depressed subjects are of interest, even if they are rarer than the mild or moderate cases. In many instances, we simply need to accept that outliers are there and *could* be influencing the model, but that we may have no real justification for amputation. Hacking at our data to make it fit our model is *not* the right way round.



### High Leverage Points 
Data points with unusual predictor values that can overly influence the fit. ... Note that a point with high leverage does not have to be an outlier. Similarly, an outlier does not automatically become a point with high leverage. Of course, the most concerning datapoints are those that are *both* outliers and have high leverage. This combination can be quantified using a measure known as *Cook's Distance*, which we will discuss further below.

### Multicollinearity
*Perfect* multicollinearity would cause the model estimation to *fail*. So, in `R`, when we have two variables that are perfectly correlated, one will have all its associated estimates set to `NA`, as we can see below.

In [8]:
data(mtcars)
wt.copy <- mtcars$wt
mod     <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
wt.copy           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


We can see that a warning had been generated saying that one coefficient is `not defined because of singularities`. This is an obvious problem that has been dealt with gracefully. The more insidious issue is when we have multicollinearity that is *high*, rather than *perfect*. We can generate a predictor with a high correlation (but not perfect correlation) by simply adding some random noise to the copy of the `wt` variable, like so:

In [8]:
set.seed(666)
data(mtcars)
wt      <- mtcars$wt
wt.copy <- wt + rnorm(n=length(wt), mean=0, sd=0.2)
print(cor(wt,wt.copy))

[1] 0.9734991


So, we can see that `wt` and `wt.copy` have a correlation of $r = 0.97$, which would be considered very high. Now let us see what happens when these are both in the same model

In [14]:
mod.multicol <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod.multicol)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2881 -2.4767 -0.0536  1.6162  6.6278 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.561      1.938  19.377   <2e-16 ***
wt            -6.966      2.467  -2.824   0.0085 ** 
wt.copy        1.559      2.309   0.675   0.5049    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.074 on 29 degrees of freedom
Multiple R-squared:  0.7567,	Adjusted R-squared:  0.7399 
F-statistic: 45.09 on 2 and 29 DF,  p-value: 1.259e-09


Notice that the standard error of `wt` has *increased* dramatically from 0.559 to 2.467. This is effectively a *four-fold* increase in uncertainty, with the $t$-statistic close to *one third* of its original value, going from -9.559 to -2.824. This is often referred to as the standard errors *blowing-up*, due to the increased uncertainty that the correlation introduces. We will discuss ways of diagnosing this below and then will discuss some "solutions" later in the lesson.

### Missing Data
We also need to consider the *reason* for the data being missing. For instance, if we are studing a condition like anxiety, it could be that the *most* anxious subjects are the ones who will not return for the follow-up visit. In this case, there is a systematic reason for the data being missing. Not only does this make imputation of missing data inappropriate, it also means that any analysis of complete data will be biased towards those with less severe anxiety. Again, the context is the most important element here for interpretation.

## Model Outputs for Checking Assumptions

### Residuals

### Standardised Residuals

### Leverage Values

### Studentised Residuals

### Predicted Values

### The Variance Inflation Factor (VIF)


In [13]:
library(car)
print(vif(mod.multicol))

       wt   wt.copy        hp 
21.917371 19.766097  1.826262 


In general, we can take:

- VIF = 1 &mdash; No multicollinearity
- 1 < VIF < 5 &mdash; Moderate multicollinearity. Some caution may be needed, but not necessarily a problem for the model. We might expect some increase in the standard error compared to no multicollinearity, especially at the higher end, but not to the extent where we could consider it to have "blown-up".
- VIF > 5 &mdash; Very high multicollinearity. Standard errors will be much larger than usual and inference will become very unstable. Almost certainly an issues and some mitigation will be needed[^VIF-foot].

## Standard Diagnostic Plots

### Residual vs Fitted Plot

### Q-Q Normal Plot

### Scale vs Location Plot

### Residuals vs Leverage Plot
Cook's distance...

`````{admonition} Residuals are Not Independent with Constant Variance
:class: tip
One of the main reasons for distinguishing between *errors* and *residuals* is that the estimation process *changes* the distributional properties of the errors. This means that *errors* and *residuals* are not expected to behave idnetically. So while it is correct to assume

$$
\epsilon_{i} \overset{\text{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^{2}\right),
$$

it is *not* technically correct to assume the same for the *errors*. This is because the estimation procedure can *induce* correlation between the errors and the errors can have non-constant variance, depending upon a property known as *leverage*. We will discuss some of these concepts next week. For now, just note that the residuals can be used as an *approximation* for the errors, but we need to perform some additional checks to make sure that this approximation is reasonable.
`````

[^VIF-foot]: Note that some authors suggest VIF=10 to be the marker for concerning multicollinearity. Here, we would recommend the more cautious approach of using VIF=5.

[^NASA-foot]: [Faraway (2005)](https://www.utstat.toronto.edu/~brunner/books/LinearModelsWithR.pdf) provides a real-world example of why this is *not* good practise via the story of how the discovery of the hole in the Ozone layer was delayed by several years because NASA's automatic data analysis algorithms were discarding very low readings as they were assumed to be mistakes.