# What is an outlier?
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

# What are the criterias to identify an outlier?
1. IQR - Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. z-score - Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

# What is the reason for an outlier to exists in a dataset?
1. Variability in the data
2. An experimental measurement error


# What are the impacts of having outliers in a dataset?
1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation

# Various ways of finding the outlier.
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range

In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [31]:
dataset=[11,13,14,16,13,12,17,18,19,11,12,14,17,19,12,103,100,109,12,18,17,16,15,14,13,12,10,17,15,14,12,11,18,10]

# 1) Detecting outlier using Z score
z = (Observation — Mean)/Standard Deviation  = (X — μ) / σ

In [32]:
outliers=[]
def detect_outliers(data):
    threshold=3
    mean=np.mean(data)
    std=np.std(data)
    
    for i in data:
        z= (i-mean)/std
        if np.abs(z)>threshold:
            outliers.append(i)
            
    return outliers

In [33]:
x=detect_outliers(dataset)

In [34]:
x

[103, 100, 109]

# 2) InterQuantile Range
75%- 25% values in a dataset

Steps
1. Arrange the data in increasing order
2. Calculate first(q1) and third quartile(q3)
3. Find interquartile range (q3-q1)
4. Find lower bound q1*1.5
5. Find upper bound q3*1.5
Anything that lies outside of lower and upper bound is an outlier

In [35]:
# step 1
sorted(dataset)

[10,
 10,
 11,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 15,
 15,
 16,
 16,
 17,
 17,
 17,
 17,
 18,
 18,
 18,
 19,
 19,
 100,
 103,
 109]

In [37]:
# step 2
q1,q3= np.percentile(dataset,[25,75])

In [38]:
print(q1,q3)

12.0 17.0


In [41]:
# step 3
iqr=q3-q1
print(iqr)

5.0


In [44]:
# step 4 and step 5
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print(lower_bound,upper_bound)

4.5 24.5
