# Detecting Outliers

Detecting outliers is unfortunately more of an art than a science. However, a common
method is to assume the data is normally distributed and based on that assumption
“draw” an ellipse around the data, classifying any observation inside the ellipse as an
inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as
-1):

In [6]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

In [7]:
# Create simulated data
features, _ = make_blobs(n_samples = 10,
 n_features = 2,
 centers = 1,
 random_state = 1)

features

array([[-1.83198811,  3.52863145],
       [-2.76017908,  5.55121358],
       [-1.61734616,  4.98930508],
       [-0.52579046,  3.3065986 ],
       [ 0.08525186,  3.64528297],
       [-0.79415228,  2.10495117],
       [-1.34052081,  4.15711949],
       [-1.98197711,  4.02243551],
       [-2.18773166,  3.33352125],
       [-0.19745197,  2.34634916]])

In [8]:
features[0,0] = 10000
features[0,1] = 10000

In [9]:
features

array([[ 1.00000000e+04,  1.00000000e+04],
       [-2.76017908e+00,  5.55121358e+00],
       [-1.61734616e+00,  4.98930508e+00],
       [-5.25790464e-01,  3.30659860e+00],
       [ 8.52518583e-02,  3.64528297e+00],
       [-7.94152277e-01,  2.10495117e+00],
       [-1.34052081e+00,  4.15711949e+00],
       [-1.98197711e+00,  4.02243551e+00],
       [-2.18773166e+00,  3.33352125e+00],
       [-1.97451969e-01,  2.34634916e+00]])

In [10]:
outlier_detector = EllipticEnvelope(contamination=.1)

In [11]:
outlier_detector.fit(features)

EllipticEnvelope()

In [12]:
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

A major limitation of this approach is the need to specify a contamination parame‐
ter, which is the proportion of observations that are outliers—a value that we don’t
know. Think of contamination as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set contamination to something small.
However, if we believe that the data is very likely to have outliers, we can set it to a
higher value.


Instead of looking at observations as a whole, we can instead look at individual fea‐
tures and identify extreme values in those features using interquartile range (IQR):

In [13]:
# Create one feature
feature = features[:,0]
feature

array([ 1.00000000e+04, -2.76017908e+00, -1.61734616e+00, -5.25790464e-01,
        8.52518583e-02, -7.94152277e-01, -1.34052081e+00, -1.98197711e+00,
       -2.18773166e+00, -1.97451969e-01])

In [14]:
# Create a function to return index of outliers
def indicies_of_outliers(x):
 q1, q3 = np.percentile(x, [25, 75])
 iqr = q3 - q1
 lower_bound = q1 - (iqr * 1.5)
 upper_bound = q3 + (iqr * 1.5)
 return np.where((x > upper_bound) | (x < lower_bound))

In [15]:
indicies_of_outliers(feature)

(array([0], dtype=int64),)

IQR is the difference between the first and third quartile of a set of data. You can
think of IQR as the spread of the bulk of the data, with outliers being observations far
from the main concentration of data. Outliers are commonly defined as any value 1.5
IQRs less than the first quartile or 1.5 IQRs greater than the third quartile.

There is no single best technique for detecting outliers. Instead, we have a collection
of techniques all with their own advantages and disadvantages. Our best strategy is
often trying multiple techniques (e.g., both EllipticEnvelope and IQR-based detec‐
tion) and looking at the results as a whole