# Data Features
Now that we have established the nature of testing assumptions visually, as well as introducing the core assumptions made by a linear model, we can start investigating *how* to check the assumptions. Before getting there, we will discuss certain features of data that we need to check for in every analysis. In some cases these can point to assumption violations, but in others these can be aspects that distort or overly-influence the results of our models.

## Outliers
In the context of our model, an outlier is a datapoint that *does not fit*. These are points that sit much further away from the model prediction than we expect, given the rest of the data. In other words, these are points that have *large residuals*. There could be many reasons for this. It may indicate a problem with the model, implying that these points *would fit* if we chose a more complex mean function. It may also indicate a problem with the data, which implies there could have been some mistake when the data was collected. Or, it could simply be that we have observed a rare event. Whatever the reason, it is important to remember that "outlier" does not necessarily mean that the data are *wrong* and that an outlier is only defined as such *relative* to our model prediction. The same data points may not be outliers in a different model. Furthermore, changing the model fit via removal of these points may make the residuals of other points larger, introducing new outliers to the analysis. Indeed, *removal* of outliers is the most extreme solution and should always be considered a *last resort*[^NASA-foot]. 

Outliers are problematic because they can bias parameter estimates, as well as increase the value of the estimated variance. Outliers therefore run the risk of biasing inference, leading to erroneous conclusions. We will discuss the definition and identification of outliers later in the lesson. For the moment, it is worth considering what should be done if a suspected outlier is detected:

1. Check the data itself to make sure there has been no data-entry mistakes.
2. If the data entry seems correct, consider the physical context. Although the value may be extreme relative to the rest of the data, is it unreasonably so?
3. If you have good evidence to indicate that the data is wrong, then removal would be appropriate. However, this is rarely the case and outliers should not just be removed for the sake of it.
4. If the data points are extreme, but removal is not justified, a method such as *robust regression* can be used, as discussed at the end of this lesson. What should be avoided is the classic Psychology trick of making an outlier less extreme by changing its value to lie just above the second most extreme point in the data. This is data tampering.



```{admonition} Perspective on Outliers
:class: tip
When dealing with outliers, always try to retain a suitable perspective. Remember, outliers are *relative* to the model, are not automatically *wrong* and must be considered *in context*. Removal should be a last resort and data tampering should be avoided. If we have a point that does not fit the model, we should consider why the model may be wrong first. We should also remember that rare events *do happen*. The probability in the tails of the distribution is not 0. For many research areas, the most useful points are those that lie in the extremes. If we are interested in depression then surely the *most* depressed subjects are of interest, even if they are rarer? In many instances, we simply need to accept that outliers are there and *could* be influencing the model, but that we may have no real justification for removal.
```

## High Leverage
As discussed above, an outlier is an extreme value of $y$. However, we can also have extreme values of the predictor variables $\mathbf{x} = \{x_{1},x_{2},\dots,x_{k}\}$. Because there are usually $k > 1$ predictors, the definition of an outlier in predictor space is more complex. The easiest way to think about this is that an outlier in predictor space is an unusual *combination* of predictor values. As an example, we can use a slightly altered version of the `mtcars` dataset. Row 31 of the original dataset has a value of `hp` equal to 335. We will replace this with a value of 60. On its own, this value is not surprising as there are other cars in the dataset with a similar horsepower. However, what *is* unusual is a value of 60 for a car with 8 cylinders and a weight above 3. So this value is unusual when *combined* with the other predictors. 

This is illustrated in the plot of predictor space below. Each axis represents one of the variables and the observation we have changed is highlighted in red. If we rotate the plot around, we can see that this point looks unusual in terms of the full pattern of data across all the variables. However, the values of this point are not particularly unusual within any single dimension.

In [None]:
# Load required package
library(rgl)

# Use mtcars data
data(mtcars)

# Extract variables
x <- mtcars$wt    # x-axis
y <- mtcars$hp    # y-axis
z <- mtcars$cyl   # z-axis

# Row to highlight
highlight_row <- 31
y[highlight_row] <- 50 # make more extreme for this example

# Create 3D scatter plot
plot3d(x, y, z, xlab = "Weight (wt)", ylab = "Horsepower (hp)", zlab = "Cylinders (cyl)",
  col = "skyblue", size = 2, type = "s", box = TRUE)

# Highlight the selected point (row 31)
spheres3d(x[highlight_row], y[highlight_row], z[highlight_row], radius=5, color = "red")

In [None]:
rglwidget(width=772)

</br>

In the context of linear models, the most commonly used metric to assess outliers in predictor space is called *leverage*. We will discuss the calculation of leverage below, but for the moment the term can be understood in reference to points that are highly influential in the model fit. Thinking of a regression line, a point with high leverage is able to pull the regression line away from the rest of the data. Importantly, leverage captures the idea that if many points have very similar combinations of predictor values, then no single point is overly-influencing the model fit. In this sense, the fit is a *balance* between all those points. However, if there is a single point that has an unusual combination of predictor values, then there are no other similar points to balance the fit and much more of the influence will be placed on that one point. In an ideal situation, all our measured points would have similar amounts of leverage and the model fit would be a balance between all of them. However, if some points have *high* leverage then it indicates that they have combinations of predictor values that are not similar to the rest of the data. Thus, these individual points have much more influence over the model fit, resulting in a situation where the model fit is *unbalanced*. This is not necessarily a bad thing, as certain unique combinations of predictor values may be of interest due to their rarity. However, this is usually a flag to *check the data*, particularly when removal of high leverage values will change the model fit more so than removal of other data points.

```{admonition} Outliers and Leverage
:class: tip
An important element to understand is that an observation that exerts high leverage on the fit will not necessarily be an outlier. Similarly, an outlier is not necessarily associated with high leverage. Remember, outliers and leverage concern the *outcome* and the *predictors* separately. We can have an observation that sits far away from the model prediction, but has a perfectly usual collection of predictor values. Similarly, we can have an observation that lies very close to the model prediction, but has a very unusual combination of predictor values. Of course, the most concern comes from an observation that is *both* an outlier and exerts high leverage. Again, this data may not be *wrong*, but it does require some consideration given that it is unusual in both outcome and predictor space. We will see a little later how this combination of outliers and leverage can be quantified using a measure known as *Cook's Distance*.
```

## Multicollinearity
Beyond outliers and observations with high leverage, we also need to take the *correlation* between our predictor variables into account. High correlation results in an issue known as *multicollinearity*, where our estimation and inference can become wildly unstable. To understand this, remember that we can conceptualise a linear model as dividing explanatory variance between the predictor variables. When predictors are *independent* (i.e. correlation is 0), this is very easy. However, when there is a lot of shared overlap, ambiguity exists about which variable should be assigned different chunks of this variation. We can think back to the [Venn diagram intuition](https://pchn63101-advanced-data-skills.github.io/Simple-Multiple-Regression/5.multiple-regression.html#venn-diagram-intuition) behind multiple regression. When multicollinearity is high, the shared region between two predictor variables is very large and their unique contribution shrinks. Even if the total amount of variation across the two predictors is large, there is greater uncertainty about how this should be split between them. This uncertainty manifests in much greater uncertainty about the parameter estimates. Because of the shared variation there is much wider range of possible parameter values for each variable, depending upon how the variation is split. As such, there will be a larger standard error. This will make the test statistics *smaller* and our statistical power will be harmed. In the worst scenarios, the standard errors are said to "blow-up", and our inference becomes wildly unstable because we just cannot say with any certainty what value each predictor is likely to have. Clearly, this is a situation we wish to avoid.

We can also make an important distinction between *perfect* multicollinearity and *high* multicollinearity. When multicollinearity is *perfect*, the correlation between the predictors is $r = 1$. This would actually cause the model estimation to *fail*. For example, in `R`, when we have two variables that are perfectly correlated, one will have all its associated estimates set to `NA`.

In [None]:
data(mtcars)
wt.copy <- mtcars$wt # identical copy of wt
mod     <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
wt.copy           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


We can see that a warning had been generated saying that one coefficient is `not defined because of singularities`. This is an obvious problem that has been dealt with gracefully. The more insidious issue is when we have multicollinearity that is *high* (i.e. $r > 0.8$), rather than *perfect*. This will not cause the estimation to fail, but it will cause problems. To see this, we can generate a predictor with a high correlation by adding a small amount of random noise to the copy of `wt`.

In [None]:
set.seed(666)
data(mtcars)
wt      <- mtcars$wt
wt.copy <- wt + rnorm(n=length(wt), mean=0, sd=0.2) # add small amount of noise
print(cor(wt,wt.copy))

[1] 0.9734991


So, we can see that `wt` and `wt.copy` have a correlation of $r = 0.97$, which would be considered very high. Now let us see what happens when both these variable are in the same model

In [None]:
mod.multicol <- lm(mpg ~ wt + wt.copy, data=mtcars)
summary(mod.multicol)


Call:
lm(formula = mpg ~ wt + wt.copy, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2881 -2.4767 -0.0536  1.6162  6.6278 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.561      1.938  19.377   <2e-16 ***
wt            -6.966      2.467  -2.824   0.0085 ** 
wt.copy        1.559      2.309   0.675   0.5049    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.074 on 29 degrees of freedom
Multiple R-squared:  0.7567,	Adjusted R-squared:  0.7399 
F-statistic: 45.09 on 2 and 29 DF,  p-value: 1.259e-09


Notice that the standard error of `wt` has *increased* dramatically from 0.559 to 2.467. This is effectively a *four-fold* increase in uncertainty, with the $t$-statistic close to *one third* of its original value, going from -9.559 to -2.824. In general, perfect multicollinearity is usually a mistake that can be easily rectified. High multicollinearity, on the other hand, comes from variables that are very closely related, to the extent that one could be used as a proxy for another. This often happens in real-world datasets. For instance, a sample of depressed individuals who all have high levels of anxiety. Here, depression severity and anxiety severity are highly correlated and both have similar degrees of explanatory power. We will discuss ways of diagnosing the degree of multicollinearity and will discuss some potential "solutions" later in the lesson.

## Missing Data
A final data feature that we always need to be aware of is *missing data*. Linear models do not have any facility to work with missing data. The regression equations require complete data for every observation. As such, any observations where values are missing will be removed by default in `R`. Other functions require explicit indication of what to do when data are missing. For instance, the `mean()` function

In [None]:
y.missing <- c(1,5,7,4,3,NA,2)
print(mean(y.missing))
print(mean(y.missing, na.rm=TRUE))


[1] NA
[1] 3.666667


Our options in the face of missing data are either *removal* or *imputation*. Removal is generally the default approach but, depending on the data in question, we can lose a lot of power. Imputation involves replacing the missing values. However, this is always effectively a *guess*, no matter how principled the approach. The simplest method is to replace the missing values with the *mean* or *median* of the variable in question. Whilst easy to do, this method can result in biased model estimates as the imputed data adds no new information and makes the average appear more certain that it is. More complex methods are based on imputing model predictions. For instance, we could use the regression predictions from the data with missing values removed. We gain back power in terms of degrees of freedom, but this method will bias the model fit because we make it seem more certain, if more data lies on the regression line. Much more complex methods are available by using *multiple imputation* approaches. For instance, a sophisticated imputation algorithm is available in the `R` package `mice`. However, no matter how complex the method, we can never *fully* justify imputation as we can never definitively know what values would have been there in the first place.

Perhaps more important is the *reason* for the data being missing. For instance, if we are studying a condition like anxiety, it could be that the *most* anxious subjects are the ones who will not return for the follow-up visit. In this case, there is a systematic reason for the data being missing. Not only does this make imputation of missing data inappropriate, it also means that any analysis of complete data will be biased towards those with less severe anxiety. Again, the context is the most important element here for interpretation. This is the difference between data being *missing at random* (MAR) and *missing not at random* (MNAR). Imputation relies on assuming MAR, but if the data is MNAR then we cannot possibly know the systematic reason for the missingness and impute properly. We will not be discussing imputation any further at this point on the course and would recommend removal as the most appropriate method at present. 

`````{topic} What do you now know?
In this section, we have explored some of the key data features you need to be aware of when considering the model fit. Although such features sometimes point to violations of the assumptions, other times they act to bias the fit in certain undesirable ways. After reading this section, you should have a good sense of:

- What an outlier is, but also how outliers are not necessarily "wrong" and cannot just be removed to improve the model fit.
- Leverage as the concept of an outlier in *predictor space*. In other words, a seemingly unusual combination of predictor values. 
- The idea that a point may not have high leverage if it is an outlier, or might not be an outlier if it has high leverage. However, the most *dangerous* points are those that are *both*.
- The concept of multicollinearity in relation to the correlation between predictors and how this can cause a dramatic increase in the estimated standard errors.
- The notion of missing data, particularly in relation to *why* data is missing.
- The idea that we can either *remove* or *impute* missing values, but both options have their limitations.
`````

[^NASA-foot]: [Faraway (2005)](https://www.utstat.toronto.edu/~brunner/books/LinearModelsWithR.pdf) provides a real-world example of why this is *not* good practice. This concerns the delay in the discovery of the hole in the ozone layer due to NASA's automatic data analysis algorithms discarding very low readings assumed to be mistakes.