# Regression diagnostics with statsmodels

_Author: Christoph Rahmede_

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1">Load the data</a></span></li><li><span><a href="#Fitting-with-statsmodels" data-toc-modified-id="Fitting-with-statsmodels-2">Fitting with statsmodels</a></span></li><li><span><a href="#Fit-plot" data-toc-modified-id="Fit-plot-3">Fit plot</a></span></li><li><span><a href="#Partial-regression-plot" data-toc-modified-id="Partial-regression-plot-4">Partial regression plot</a></span></li><li><span><a href="#CCPR-plot" data-toc-modified-id="CCPR-plot-5">CCPR plot</a></span></li><li><span><a href="#Leverage-versus-normalized-residuals" data-toc-modified-id="Leverage-versus-normalized-residuals-6">Leverage versus normalized residuals</a></span></li><li><span><a href="#Influence-plot" data-toc-modified-id="Influence-plot-7">Influence plot</a></span></li><li><span><a href="#Cooks-distance" data-toc-modified-id="Cooks-distance-8">Cooks distance</a></span></li><li><span><a href="#Model-with-outliers-removed" data-toc-modified-id="Model-with-outliers-removed-9">Model with outliers removed</a></span></li><li><span><a href="#Variance-inflation-factor" data-toc-modified-id="Variance-inflation-factor-10">Variance inflation factor</a></span></li></ul></div>

In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

warnings.simplefilter('ignore')

## Load the data

In [2]:
from sklearn.datasets import load_boston

In [3]:
data = load_boston()
print(data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [4]:
from sklearn.preprocessing import StandardScaler

In [5]:
scaler = StandardScaler()
df = pd.DataFrame(scaler.fit_transform(data.data), columns=data.feature_names)
df['target'] = data.target
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
target     506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB


## Fitting with statsmodels

We can fit the same kind of model with statsmodels. We can work directly with the data matrices, but also the model setup with a formula is very appealing. The results will be exactly the same, but we receive some additional information.

In [6]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [7]:
predictors = [col for col in df.columns if col != 'target']

In [8]:
# results = ...

## Fit plot

## Partial regression plot

## CCPR plot

## Leverage versus normalized residuals

## Influence plot

## Cooks distance

## Model with outliers removed

## Variance inflation factor

In [9]:
from statsmodels.stats.outliers_influence import variance_inflation_factor