### What is an Outlier?

Outlier is a data point in a dataset that is distant from all other observations. A data point that lies outside the overall
distribution of the dataset.

### What is the criteria to identify an outlier?

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile.
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation.

### What is the reason for an outlier to exists in a dataset?

1. Variability in the data.
2. An experiemntal measurement error.

### What are the impacts having an Outliers in dataset?

1. It causes various problems during our statistical analysis.
2. It may cause a significant impact on the mean and standrad deviation.

### Various ways of finding the outlier?

1. Using scatter plots.
2. Box plot.
3. Using Z_score.
4. Using the IQR (Interquartile Range).

## lets go with an Exapmles

In [4]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

### Detecting outlier using Z_score.

### Z_score
Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [5]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
outliers = []
def detect_outliers(data):
    
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    
    for i in data:
        z_score = (i-mean)/std
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [7]:
outlier_pt = detect_outliers(dataset)
outlier_pt

[102, 107, 108]

### Interquartile range

75%- 25% values in a dataset

#### Steps
1. Arrange the data in increasing order
2. Calculate first(q1) and third quartile(q3)
3. Find interquartile range (q3-q1)
4. Find lower bound q1*1.5
5. Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

In [8]:
sort_data = sorted(dataset)
sort_data

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [9]:
Quartile1, Quartile3 = np.percentile(dataset,[25,75])

In [10]:
print(Quartile1, Quartile3)

12.0 15.0


In [11]:
## Find the IQR
iqr_value = Quartile3-Quartile1
iqr_value

3.0

In [12]:
## Find the lower bound value and the higher bound value
lower_bound_val = Quartile1 - (1.5*3.0)
upper_bound_val = Quartile3 + (1.5*3.0)

In [13]:
print(lower_bound_val, upper_bound_val)

7.5 19.5


In [33]:
for i in sort_data:
    if i > 19 or i < 7:
        print(i)

102
107
108
