# Statistical Outlier Detection
In statistics, if a data distribution is approximately normal, then we can use the mean and standard derivation to estimate the probability of a data point falls into a certain range:
*   68% data falls in mean +/- one standard derivation
*   95% data falls in mean +/- two standard derivations
*   99.7% data falls in mean +/- three standard derivations
Thus, we can use mean +/ three standard derivations as the boundary of normal data. Any data falls out of the boundary will be considered as outliers.

## Dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Read data

In [2]:
df = pd.read_csv('/work/Nov2Temp.csv')
df

Unnamed: 0,high,low
0,58,25
1,26,11
2,53,24
3,60,37
4,67,42
...,...,...
115,99,33
116,99,27
117,18,38
118,15,51


## Drop the missing values

In [3]:
df[df['low']<-100]

Unnamed: 0,high,low
72,-998,-998
79,-998,-998


In [4]:
df.drop([72, 79], inplace = True)
df

Unnamed: 0,high,low
0,58,25
1,26,11
2,53,24
3,60,37
4,67,42
...,...,...
115,99,33
116,99,27
117,18,38
118,15,51


## Run the detection

In [5]:
df[(df['high']< (df['high'].mean() - 3 * df['high'].std()))|
(df['high']> (df['high'].mean() + 3 * df['high'].std()))|
(df['low']< (df['low'].mean() - 3 * df['low'].std()))|
(df['low']> (df['low'].mean() + 3 * df['low'].std()))]

Unnamed: 0,high,low
111,48,99
112,43,99
113,64,99
