# Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [10]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [11]:
np.random.seed(12345)

In [12]:
data = DataFrame(np.random.randn(1000, 4))

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067684,0.067924,0.025598,-0.002298
std,0.998035,0.992106,1.006835,0.996794
min,-3.428254,-3.548824,-3.184377,-3.745356
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.366626,2.653656,3.260383,3.927528


Suppose you wanted to find values in one of the columns exceeding three in magnitude:

In [13]:
col4 = data[3]

col4

0     -0.555730
1      0.281746
2     -1.296221
3      0.886429
4     -0.438570
         ...   
995    1.653126
996   -1.502870
997    0.133304
998   -0.303340
999    0.882755
Name: 3, Length: 1000, dtype: float64

In [14]:
col4[np.abs(col4) > 3]

97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64

To select all rows having a value exceeding 3 or -3, you can use the any method on a boolean DataFrame:

In [16]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
5,-0.539741,0.476985,3.248944,-1.021228
97,-0.774363,0.552936,0.106061,3.927528
102,-0.655054,-0.56523,3.176873,0.959533
305,-2.315555,0.457246,-0.025907,-3.399312
324,0.050188,1.951312,3.260383,0.963301
400,0.146326,0.508391,-0.196713,-3.745356
499,-0.293333,-0.242459,-3.05699,1.918403
523,-3.428254,-0.296336,-0.439938,-0.867165
586,0.275144,1.179227,-3.184377,1.369891
808,-0.362528,-3.548824,1.553205,-2.186301


Values can just as easily be set based on these criteria. Here is code to cap values outside the interval -3 to 3:


In [18]:
data[np.abs(data) > 3] = np.sign(data) * 3

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067623,0.068473,0.025153,-0.002081
std,0.995485,0.990253,1.003977,0.989736
min,-3.0,-3.0,-3.0,-3.0
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.0,2.653656,3.0,3.0


The ufunc np.sign returns an array of 1 and -1 depending on the sign of the values.