# Influential points: Leverages and Outliers

## Learning Objectives
- Identify when a datapoint is influential
- Understand what the difference between a leverage and outlier is
- Understand the effect that these anamolous datapoints can have on a model
- Dealing with influential points

This is the final notebook we're going to study before applying all the EDA techniques we've learnt to a dataset. In this notebook, we'll be looking at identfying datapoints which may not seem to follow the general trend of the data. These datapoints can be classified into two categories: **leverages** and **outliers**. Subsequent to providing examples of both, we'll look the affect they can have on your model, and how to correctly deal with them.

We'll be working with toy data (sourced from [here](https://online.stat.psu.edu/stat462/node/170/)), because the concepts are easily applicable to any continuous variable. The dataset is generated and doesn't come with labelled axis, but we'll assign an example to it calling it years of experience vs salary.

There are two kinds of plots which are great for identifying influential points: Scatter plots and box plots. When we are fitting a model, scatter plots are preferred as they map an x value to a y value (hence we have two dimensions). When looking at a variable in isolation, boxplots are necessary (recall we did this in a previous notebook). 

So, what's the difference between leverages and outliers?:
- Leverages: A datapoint has high leverage if it has an extreme x value. Extreme simply refers to a point that doesn't fall within the expected confines of the data in the x range.
- Outlier: An outlier is a datapoint whose response y doesn't seem to follow the general trend of the rest of the data.

Note that a given datapoint can be *both* a leverage and outlier. They are not mutually exclusive. Furthermore, even if a datapoint is part of one of these categories, that doesn't necessarily make it influential.

We'll load in some datasets and see if you can distinguish whether the points are leverages or outliers

In [27]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

df_1 = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/influential/1.txt", sep="\t", encoding="utf-16")
df_2 = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/influential/2.txt", sep="\t", encoding="utf-16")
df_3 = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/influential/3.txt", sep="\t", encoding="utf-16")
df_4 = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/influential/4.txt", sep="\t", encoding="utf-16")
df_list = [df_1, df_2, df_3, df_4]

for df in df_list:
    df.drop("Row", axis=1, inplace=True)
    df.columns = ["YoE", "Salary"] # YoE = Years of Experience

df_1.head()

Unnamed: 0,YoE,Salary
0,0.1,-0.0716
1,0.45401,4.1673
2,1.09765,6.5703
3,1.27936,13.815
4,2.20611,11.4501


In [31]:
rows, cols = 2, 2
fig = make_subplots(rows=rows, cols=cols, subplot_titles=("Example 1", "Example 2", "Example 3", "Example 4"))

for i, df in enumerate(df_list):
    row = (i // cols) + 1
    col = (i % cols) + 1
    
    fig.add_trace(
        go.Scatter(x=df["YoE"], y=df["Salary"], mode="markers"),
        row=row, col=col
    )

fig.update_layout(title_text="Years of Experience vs Salary (x1000)", showlegend=False)
fig.show()

So, Example 1 doesn't have any leverage or outliers in it. Example 2 has an outlier at x=4 because this point isn't following the general trend of the data. Example 3 contains a leverage point, but it isn't an outlier because it follows the general trend of the data. Example 4 is both an outlier and a leverage point, as it is an extreme x value and the y value isn't something that we'd expect from the trend of the rest of the data.

Let's investigate each of these a bit more throughly. We'll omit Example 1 because there is nothing wrong with that data. We'll start with Example 4 first, as this is the most obvious case of a point which could be influential. Following this, we'll analyse Example 2 and 3.

### Example 4

To demonstrate how much an outlier can skew a value, let's start by fitting two simple linear regression models to this dataset: one with the datapoint present, and the other without. Note that in most situations you probably don't need to do this when actually analysing your data - we're running through it here so you can see the effect that a true outlier has on the model.

In [38]:
df_4_no_outlier = df_4[(df_4["YoE"] != 13) & (df_4["Salary"] != 15)]
px.scatter(df_4_no_outlier, "YoE", "Salary")

In [76]:
import statsmodels.formula.api as smf

## Fit two models: with_outlier and without_outlier.
with_outlier_model = smf.ols("Salary ~ YoE", df_4).fit()
without_outlier_model = smf.ols("Salary ~ YoE", df_4_no_outlier).fit()

## Create a go figure object
fig = go.Figure()

## Plot the x, y datapoints from the dataframe with an outlier. Remember to appropiately name the trace
fig.add_trace(
    go.Scatter(x=df_4["YoE"], y=df_4["Salary"], mode="markers", name="Years of Experience vs Salary (x1000)")
)

## Add a trace, plotting the fitted values from the with_outlier model
fig.add_trace(
    go.Scatter(x=df_4["YoE"], y=with_outlier_model.fittedvalues, name="Regression with outlier")
)

## Add a trace, plotting the fitted values from the without_outlier model
fig.add_trace(
    go.Scatter(x=df_4["YoE"], y=without_outlier_model.fittedvalues, name="Regression without outlier")
)

## Add a title and show the graph
fig.update_layout(title="Example 4: Fitted regression line with and without outlier")
fig.show()

Woah! That's a big change in our regression line due to that one aberrant point. It's massively changed our regression line so we can deem this as influential. Let's look at what's going on analytically.

In [45]:
print(with_outlier_model.summary())
print("*"*100)
print(without_outlier_model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.552
Model:                            OLS   Adj. R-squared:                  0.528
Method:                 Least Squares   F-statistic:                     23.41
Date:                Wed, 12 Aug 2020   Prob (F-statistic):           0.000114
Time:                        15:13:15   Log-Likelihood:                -78.017
No. Observations:                  21   AIC:                             160.0
Df Residuals:                      19   BIC:                             162.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.5046      4.222      2.014      0.0

There are a couple of things I'm going to mention here. The $R^2$ value and the coefficients. We can see that with the outlier removed, we have an $R^2$ around 0.45 higher! That's huge!! Secondly, we can compare the two regression formulas with each other:

$$
\hat{y_O} = 3.32\text{YoE} + 8.50 \\
\hat{y_N} = 5.12\text{YoE} + 1.73
$$

Where $y_O$ stands for the outlier model and $y_N$ stands for the non-outlier model. As we see, there is a widly big change in YoE coefficient - with above a 50% increase from the O model to the N model. This one datapoint will thus wrongly and heavily skew any prediction.

The standard error of the O model is about three times bigger than that of the N model. This means that a 95% confidence interval would be substantially wider for the O model than the N model (and the \[0.025, 0.975\] columns numerically demonstrate this).

Finally, despite both models showing that YoE is significant (which would be the case since we're only working with one explanatory variable), our t value has massively increased (from 4.83 to 25.6) when the *one* outlier we had was removed.

Thus, we can conclude that this one datapoint was a leverage, outlier, and influential point.

However, some datapoints, despite being outliers or leverages, may not actually influence the model significantly. Let's take a look at Examples 2 and 3 to determine what's going on in these situations.


In [46]:
df_2_n = df_2[(df_2["YoE"] != 4) & (df_2["Salary"] != 40)]
df_3_n = df_3[(df_3["YoE"] != 14) & (df_3["Salary"] != 68)]

In [77]:
# Example 2
with_outlier_model = smf.ols("Salary ~ YoE", df_2).fit()
without_outlier_model = smf.ols("Salary ~ YoE", df_2_n).fit()

fig = go.Figure()
fig.add_trace(
    go.Scatter(x=df_2["YoE"], y=df_2["Salary"], mode="markers", name="Years of Experience vs Salary (x1000)"))
fig.add_trace(
    go.Scatter(x=df_2["YoE"], y=with_outlier_model.fittedvalues, name="Regression with outlier"))
fig.add_trace(
    go.Scatter(x=df_2["YoE"], y=without_outlier_model.fittedvalues, name="Regression without outlier"))

fig.update_layout(title="Example 2: Fitted regression line with and without outlier")
fig.show()

In [74]:
print("With outlier parameters: \n", with_outlier_model.params)
print("With outlier R2: \t", with_outlier_model.rsquared)

print("\n")
print("Without outlier parameters: \n", without_outlier_model.params)
print("Without outlier R2: \t", without_outlier_model.rsquared)


With outlier parameters: 
 Intercept    2.957638
YoE          5.037345
dtype: float64
With outlier R2: 	 0.9100509522985457


Without outlier parameters: 
 Intercept    1.732178
YoE          5.116869
dtype: float64
Without outlier R2: 	 0.9731681321750607


Although we see some slight deviation deviation in the coefficients (specificially looking at YoE), the two values are very similar to each other. Similarly, even though we see some deviation in the $R^2$ metric, it is considerbly smaller than what we witnessed previously. Despite the slight drop in the metric in the O model, the relationship between y and x is still strong.

So despite this being an outlier, I would say that it isn't an influential point so having it remain in the model would be fine.

In [78]:
# Example 2
with_outlier_model = smf.ols("Salary ~ YoE", df_3).fit()
without_outlier_model = smf.ols("Salary ~ YoE", df_3_n).fit()

fig = go.Figure()
fig.add_trace(
    go.Scatter(x=df_3["YoE"], y=df_3["Salary"], mode="markers", name="Years of Experience vs Salary (x1000)"))
fig.add_trace(
    go.Scatter(x=df_3["YoE"], y=with_outlier_model.fittedvalues, name="Regression with outlier"))
fig.add_trace(
    go.Scatter(x=df_3["YoE"], y=without_outlier_model.fittedvalues, name="Regression without outlier"))

fig.update_layout(title="Example 3: Fitted regression line with and without outlier")
fig.show()

Woah! Here we can barely tell the difference in the regression line. Looking at the numbers, we'll see that the things we check for are also very similar between the two variants - especially regarding the $R^2$. Thus, despite this point being a leverage, it is also not an influential point

In [79]:
print("With outlier parameters: \n", with_outlier_model.params)
print("With outlier R2: \t", with_outlier_model.rsquared)

print("\n")
print("Without outlier parameters: \n", without_outlier_model.params)
print("Without outlier R2: \t", without_outlier_model.rsquared)


With outlier parameters: 
 Intercept    2.467879
YoE          4.927221
dtype: float64
With outlier R2: 	 0.9773929107826496


Without outlier parameters: 
 Intercept    1.732178
YoE          5.116869
dtype: float64
Without outlier R2: 	 0.9731681321750607


This concludes our short lecture on outliers! It is also worth noting that the OLS model class contains a method called [`.get_influence()`](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.get_influence.html#statsmodels.regression.linear_model.OLSResults.get_influence). This method allows us access to the [`OLSInfluence`](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.OLSInfluence.html#statsmodels.stats.outliers_influence.OLSInfluence) class which provides a ton of useful analysis tools if you wanted to dig deeper into outlier/leverage/influential point analysis. Take a look at the documentation and feel free to play around and Google concepts you don't understand or want to learn more about. Furthermore, there are many outlier detection and removal techniques out there, and once again, a Google search is a great start.