### What is an outlier?
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### What are the criteria to identify an outlier?

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

### What is the reason for an outlier to exists in a dataset?

1. Variability in the data
2. An experimental measurement error

### What are the impacts of having outliers in a dataset?

1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation

### Various ways of finding the outlier.
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range



In [4]:
# data set
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

## Detecting outlier using Z score

### Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [11]:
def getOutliers(dataset):
    outliers = []
    
    mew = np.mean(dataset)
    sigma = np.std(dataset)
    print('mean of the data = ' + str(mew))
    print('standard deviation of the data = ' + str(sigma))
    
    threshold = 3
    
    for x in dataset:
        z = (x-mew)/sigma
        print(z)
        if np.abs(z) > threshold:
            outliers.append(x)
    return outliers

outliersItems = getOutliers(dataset)
print('Outliers -> ' , outliersItems)

mean of the data = 21.176470588235293
standard deviation of the data = 26.37230118696876
-0.38587723217963826
-0.4237958041279264
-0.3479586602313501
-0.27212151633477377
-0.3479586602313501
-0.23420294438648565
-0.27212151633477377
-0.31004008828306195
-0.23420294438648565
3.064712815114584
-0.3479586602313501
-0.27212151633477377
-0.15836580048990934
-0.08252865659333301
3.254305674856025
-0.4237958041279264
-0.31004008828306195
-0.3479586602313501
-0.27212151633477377
-0.3479586602313501
3.292224246804313
-0.3479586602313501
-0.38587723217963826
-0.27212151633477377
-0.31004008828306195
-0.23420294438648565
-0.4237958041279264
-0.23420294438648565
-0.3479586602313501
-0.4237958041279264
-0.27212151633477377
-0.31004008828306195
-0.23420294438648565
-0.4237958041279264
Outliers ->  [102, 107, 108]


### To Calculate any percentile of the dataset : <br>
<b>qth_percentile = qth% * len(dataset)</b>   -> this gives the index <br> 
element = dataset(qth_percentile)

In [18]:
dataset.sort()

n = len(dataset)
q1=int(n*0.25)   # 25th percentile
dataset[q1]
q3=int(n*0.75)   # 75th percentile
dataset[q3]

15

## InterQuartile Range
[75%- 25% values ] in a dataset<br>
Steps :
1. Sort the data in increasing order
2. Calculate first_quartile(q1) and third quartile(q3)
3. Find interquartile range iqr=(q3-q1)
4. Find lower bound (q1 - 1.5*iqr )
5. Find upper bound (q3 + 1.5*iqr )<br>
   Anything that lies outside of lower and upper bound is an outlier.

In [28]:
dataset.sort()

q1_quartile,q3_quartile = np.percentile(dataset,[25,75])
print('25th percentile of the dataset = ' + str(q1_quartile))
print('75th percentile of the dataset = ' + str(q3_quartile))

iqr = q3_quartile - q1_quartile

print('Inter Quartile Range = ' + str(iqr))

lower_bound = q1_quartile - iqr*1.5
upper_bound = q3_quartile + iqr*1.5

data_bound = []
data_bound.append(lower_bound)
data_bound.append(upper_bound)
print('Bound of the data ',data_bound)

outliers = []
outliers = [x for x in dataset if x > upper_bound or x < lower_bound]
print('Outliers -->' , outliers)

25th percentile of the dataset = 12.0
75th percentile of the dataset = 15.0
Inter Quartile Range = 3.0
Bound of the data  [7.5, 19.5]
Outliers --> [102, 107, 108]
