Merge 1020cab into a713f47

DistrictDataLabs · Jun 23, 2020 · 45ade40 · 45ade40
2 parents a713f47 + 1020cab
commit 45ade40
Showing 1 changed file with 62 additions and 8 deletions.
diff --git a/docs/api/regressor/influence.rst b/docs/api/regressor/influence.rst
@@ -3,12 +3,14 @@
 Cook's Distance
 ===============
 
-Cook's Distance is a measure of an observation or instances' influence on a linear
-regression. Instances with a large influence may be outliers and datasets that have a
-large number of highly influential points might not be good predictors to fit linear
-models. The ``CooksDistance`` visualizer shows a stem plot of all instances by index
-and their associated distance score, along with a heuristic threshold to quickly show
-what percent of the dataset may be impacting OLS regression models.
+Cook’s Distance is a measure of an observation or instances’ influence on a
+linear regression. Instances with a large influence may be outliers, and
+datasets with a large number of highly influential points might not be
+suitable for linear regression without further processing such as outlier
+removal or imputation. The ``CooksDistance`` visualizer shows a stem plot of
+all instances by index and their associated distance score, along with a
+heuristic threshold to quickly show what percent of the dataset may be
+impacting OLS regression models.
 
 =================   ==============================
 Visualizer           :class:`~yellowbrick.regressor.influence.CooksDistance`
@@ -32,10 +34,63 @@ Workflow             Dataset/Sensitivity Analysis
     visualizer.fit(X, y)
     visualizer.show()
 
+The presence of so many highly influential points suggests that linear
+regression may not be suitable for this dataset. One or more of the four
+assumptions behind linear regression might be being violated; namely one of:
+independence of observations, linearity of response, normality of residuals,
+or homogeneity of variance ("homoscedasticity"). We can check the latter three
+conditions using a residual plot:
+
+.. plot::
+    :context: close-figs
+    :alt: Residual plot using concrete dataset
+
+    from sklearn.linear_model import LinearRegression
+    from yellowbrick.regressor import ResidualsPlot
+
+    # Instantiate and fit the visualizer
+    model = LinearRegression()
+    visualizer_residuals = ResidualsPlot(model)
+    visualizer_residuals.fit(X, y)
+    visualizer_residuals.show()
+
+The residuals appear to be normally distributed around 0, satisfying the
+linearity and normality conditions. However, they do skew slightly positive
+for larger predicted values, and also appear to increase in magnitude as the
+predicted value increases, suggesting a violation of the homoscedasticity
+condition.
+
+Given this information, we might consider one of the following options: (1)
+using a linear regression anyway, (2) using a linear regression after removing
+outliers, and (3) resorting to other regression models. For the sake of
+illustration, we will go with option (2) with the help of the Visualizer’s
+public learned parameters ``distance_`` and ``influence_threshold_``:
+
+.. plot::
+    :context: close-figs
+    :alt: Residual plot using concrete dataset after outlier removal
+
+    i_less_influential = (visualizer.distance_ <= visualizer.influence_threshold_)
+    X_li, y_li = X[i_less_influential], y[i_less_influential]
+
+    model = LinearRegression()
+    visualizer_residuals = ResidualsPlot(model)
+    visualizer_residuals.fit(X_li, y_li)
+    visualizer_residuals.show()
+
+The violations of the linear regression assumptions addressed earlier appear
+to be diminished. The goodness-of-fit measure has increased from 0.615 to
+0.748, which is to be expected as there is less variance in the response
+variable after outlier removal.
+
 Quick Method
 ------------
 
-Similar functionality as above can be achieved in one line using the associated quick method, ``class_prediction_error``. This method will instantiate and fit a ``ClassPredictionError`` visualizer on the training data, then will score it on the optionally provided test data (or the training data if it is not provided).
+Similar functionality as above can be achieved in one line using the
+associated quick method, ``cooks_distance``. This method will instantiate and
+fit a ``CooksDistance`` visualizer on the training data, then will score it on
+the optionally provided test data (or the training data if it is not
+provided).
 
 .. plot::
     :context: close-figs
@@ -62,4 +117,3 @@ API Reference
     :members: CooksDistance, cooks_distance
     :undoc-members:
     :show-inheritance:
-