# EXTRA: Limitation of Linear Regression

The goal of this notebook is to highlight the limitations of Linear Regression.  
The Linear Regression model is only able to model **linear relationships**.  
We use this famous collection of datasets to show OLS can return the same results from very different data.

__[Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe's_quartet)__

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

%matplotlib inline

In [None]:
# Anscombe dataset includes 4 different data sets
anscombe = sns.load_dataset("anscombe")

In [None]:
# Let's see how the dataset looks like
anscombe.head(5)

In [None]:
# Get the descriptive statistics for each seperate dataset by grouping by dataset
anscombe.groupby('dataset').describe()

In [None]:
# Now we want to build a model for each single dataset included in the overall one.
# Let's create a function instead of writing it four times in a row
import statsmodels.formula.api as smf
# Define function, which takes dataset as string, defines x and y, fits an OLS model and generates intercept, slope and r_squared value
def get_summarystats(dataset):
    x = anscombe[anscombe.dataset==dataset].x
    y = anscombe[anscombe.dataset==dataset].y
    res = smf.ols(formula='y ~ x', data = anscombe[anscombe.dataset==dataset]).fit()
    intercept, slope = res.params
    r_value = res.rsquared
    return intercept, slope, r_value

In [None]:
# Let's call the function for every dataset within a for loop and save results in an array
results = []
for dataset in 'I II III IV'.split():
    intercept, slope, r_value = get_summarystats(dataset)
    results.append([slope, intercept, r_value])

print('[slope, intercept, r_value] for each dataset')
results

Even the descriptive statistics and the linear regression parameters are nearly the same, the datasets look destinctivly different.
Let's have a look at the data and the linear regression lines.

In [None]:
# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe,
           col_wrap=2, ci=None, palette="muted", height=4, scatter_kws={"s": 50, "alpha": 1});

We can see from the above that the lines which have been fitted as well as the summary statistics are all the same - although we have completely different data.  

In this cases, it is purposefully obvious that the model has problems fitting lines to datasets II, III and IV.  
In normal cases it is necessary to look at the plot of the residuals to catch these errors.

In [None]:
# Defining function which calculates residuals and plots them
def get_residuals(dataset):
    obs_values = anscombe[anscombe.dataset==dataset].y 
    pred_values = get_summarystats(dataset)[0] * anscombe[anscombe.dataset==dataset].x + get_summarystats(dataset)[1] 
    residuals = obs_values - pred_values
    #Plot residuals
    fig, ax = plt.subplots(figsize=(8, 4))
    ax.scatter(anscombe[anscombe.dataset==dataset].x, residuals, alpha=0.5)
    ax.set_ylabel("Residuals")
    ax.set_xlabel("x")
    fig.suptitle('Residual Scatter Plot')
    plt.show()

In [None]:
# Use defined function to plot residuals for all datasets
for dataset in 'I II III IV'.split():
    get_residuals(dataset) 

## Summary
In a well-fitted model, the residuals will be randomly distributed - whereas in rather badly fitted models, you will find patterns in your residual-distribution.  
These tell you that there are additional explanatory factors missing from your model.  
[Further reading on residuals](https://www.statology.org/residuals/) as well as [here](https://towardsdatascience.com/how-to-use-residual-plots-for-regression-model-validation-c3c70e8ab378).