### What is an outlier?
An outlier is a data point in a data, set distant from all other observations. A data point lies outside the overall distribution of the dataset.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Criteria to identify outlier:
1. Data point that falls outside of 1.5 times of an Inter-Quartile Range (IQR) above the 3rd quartile and below the 1st quartile.
2. Data point falls outside of 3rd standard deviation, we can use a z score and if the z score falls outside of 2 standard deviation.

#### Reason for an outlier to exist:
1. Variability in the data
2. An experimental measurement error

#### Impacts of having outliers:
1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation.

#### Various ways of finding the outlier:
 1. Using scatter plots
 2. Box plot
 3. Using Z-score, and
 4. Using IQR

In [7]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

#### Detecting outlier using Z score
##### Using Z scores
Formula for Z score = (Observation - Mean)/(Standard Deviation)

z = (X — μ) / σ


In [9]:
outliers = []
def detect_outliers(data):
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    for i in data:
        z_score = (i-mean)/std;
        if np.abs(z_score) > threshold: 
            # i.e., xi lies outside of 3rd standard deviation
            outliers.append(i)
    return outliers

In [10]:
outlier_pt = detect_outliers(dataset)
print(outliers)

[102, 107, 108]


#### InterQuartile Range
75% ~ 25% values in a dataset

##### Steps
1. Arrange the data in increasing order
2. Calculate first(q1) and third quartile(q3)
3. Find interquartile range(q3-q1)
4. Find lower bound q1*1.5
5. Find upper bound q3*1.5
* Anything that lies outside of lower and upper bound is an outlier


In [12]:
quartile1, quartile3 = np.percentile(dataset,[25,75])
print(quartile1,quartile3)

12.0 15.0


In [21]:
# Find the IQR
iqr_value = quartile3 - quartile1
print(iqr_value)

3.0


In [None]:
lower_bound = quartile1 - 1.5*(