# Outliers

### What is outliers?

* In statistics, an outlier is a data point that differs significantly from other observations. Outliers can be caused by measurement errors, data entry errors, or simply by the presence of extreme values in the data. 


### What are the criteria to identify an outlier?

* Data point that falls <b>outside of 1.5 times of an interquartile </b>range above the 3rd quartile and below the 1st quartile
* Data point that falls <b>outside of 3 standard deviations from norm (mean)</b>. we can use a z score and if the z score falls outside of 2 standard deviation

### What is the reason for an outlier to exists in a dataset?

* Variability in the data
* An experimental measurement error

### What are the impacts of having outliers in a dataset?

* It causes various problems during our statistical analysis
* It may cause a significant impact on the mean and the standard deviation

### Various ways of finding the outlier.
* Using scatter plots
* Box plot
* using z score
* using the IQR interquantile range


### Notes
* Identifying unusual data points - Z-scores can be used to identify data points that are unusual or outliers. Data points with Z-scores greater than 3 or less than -3 are typically considered to be outliers. These data points may be due to errors in measurement or may represent extreme values within the population.


In [1]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

## Detecting outlier using Z score

### Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


outliers=[]
def detect_outliers(data):
    
    threshold=3
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [3]:
outlier_pt=detect_outliers(dataset)

In [4]:
outlier_pt

[102, 107, 108]

## InterQuantile Range

75% to 25% values in a dataset

<b> Steps</b>
* Arrange the data in increasing order
* Calculate first(q1) and third quartile(q3)
* Find interquartile range (q3-q1)
* Find lower bound q1*1.5
* Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

In [5]:
## Perform all the steps of IQR
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [6]:
quantile1, quantile3= np.percentile(dataset,[25,75])

In [7]:
print(quantile1,quantile3)

12.0 15.0


In [8]:
## Find the IQR

iqr_value=quantile3-quantile1
print(iqr_value)

3.0


In [9]:
## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr_value) 
upper_bound_val = quantile3 +(1.5 * iqr_value) 

In [10]:
print(lower_bound_val,upper_bound_val)

7.5 19.5
