## How to find outliers in Dataset ?
- An outlier is a datapoint in dataset that is distant from all other observations

### What are the criteria to identify an outlier?
- Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
- Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation
### What is the reason for an outlier to exists in a dataset?
- Variability in the data
- An experimental measurement error
### What are the impacts of having outliers in a dataset?
- It causes various problems during our statistical analysis
- It may cause a significant impact on the mean and the standard deviation

### Various ways of findig outliers
- using scatter plot
- using box plot
- using Z-score
- using InterQuartile Range(IQR)

In [4]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

#### Detecting outlier using Z score
##### Using Z score

- Formula for Z score = (Observation — Mean)/Standard Deviation

- z = (X — μ) / σ

In [13]:
outliers=[]
def detect_outlier(data):
    threshold=3
    mean=np.mean(data)
    sd=np.std(data)
    
    for i in data:
        z_score=(i-mean)/sd
        if np.abs(z_score)>threshold: #here abs can never be negative
            outliers.append(i)
    return outliers
    

In [14]:
outlier_pt=detect_outlier(dataset)

In [15]:
outlier_pt

[102, 107, 108]

#### InterQuantile Range
- 75%- 25% values in a dataset

##### Steps
- 1.Arrange the data in increasing order
- 2.Calculate first(q1) and third quartile(q3)
- 3.Find interquartile range (iqr=q3-q1)
- 4.Find lower bound q1-(1.5 * iqr)
###### this 1.5 depends on distribution and mostly suitable value is '1.5' and that has been prooved 
- 5.Find upper bound q3+(1.5 * iqr)
- Anything that lies outside of lower and upper bound is an outlier

In [16]:
# Perform all steps of IQR
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [17]:
quantile1,quantile3=np.percentile(dataset,[25,75])

In [18]:
print(quantile1,quantile3)

12.0 15.0


In [19]:
iqr_value=quantile3-quantile1
iqr_value

3.0

In [20]:
lower_bound_value=quantile1-(1.5*iqr_value)
upper_bound_value=quantile3+(1.5*iqr_value)

In [21]:
print(lower_bound_value,upper_bound_value)

7.5 19.5
