### Outliers
* An outlier is a data point in a data set that is distant from all other observations.
* A data point that lies outside the overall distribution of the dataset.

#### Criteria to Identify an Outlier
1. InterQuartile Range(IQR) -  Data point that falls outside of 1.5 times of an InterQuartile Range i.e., above 3rd Quartile and below 1st Quartile
2. Z Score - Data point that falls outside of 3 Standard Deviations, we can use Z score 

#### Reason for an outlier to exists in a Dataset
1. Variability in the data
2. Human Error: An experimantal Measurement Error

####  Impact of Outliers in a Dataset
1. May cause problems in Statistical Analysis
2. May cause significant impact on Mean and Standard Deviation

#### Various Ways to Find Outliers
1. Using Scatter plots
2. Box Plot
3. Using z score
4. Using IQR InterQuartile Range

In [1]:
"\u03BC"

'μ'

In [2]:
"\u03C3"

'σ'

In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]


### 1. Detecting Outliers Using Z score
* Using  z score Formula
z = (X — μ) / σ

In [5]:
# function to find Outliers with Z score
def detect_outliers(data):   
    
    outliers=[]
    threshold=3
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [6]:
detect_outliers(dataset)

[102, 107, 108]

### 2. InterQuantile Range IQR
    * 75% - 25% values Subtraction in a dataset
#### Steps
1. Arrange the data in increasing order
2. Calculate first(q1) and third quartile(q3)
3. Find Interquartile range IQR = (q3-q1)
4. Find lower bound = q1-1.5*IQR
5. Find upper bound = q3+1.5*IQR
- Anything that lies outside of lower and upper bound is an outlier

#### Why Multiplying 1.5 ? - Doubt
- Two sided Z score 3
- One sided IQR 3/2 

In [7]:
## Step1: Perform all the steps of IQR
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [8]:
q1, q3 = np.percentile(dataset,[25,75])
print(q1, q3)

12.0 15.0


In [9]:
# Find the IQR
iqr = q3 - q1
print(iqr)

3.0


In [10]:
# Find the lower bound value and upper bound value
lower_bound_val = q1 - (1.5 * iqr)
upper_bound_val = q3 + (1.5 * iqr)
print(lower_bound_val,upper_bound_val)

7.5 19.5


In [11]:
# function to find Outliers using IQR

def OutliersIQR(data):
    outliers = []
    qL1, qL3 = np.percentile(data,[25,75])
    iqLr = qL3 - qL1
    LB = qL1 - (1.5 * iqLr)
    UB = qL3 + (1.5 * iqLr)
    
    for i in data:
        if  (i < LB or i > UB) :
            outliers.append(i)
    return outliers

In [12]:
OutliersIQR(dataset)

[102, 107, 108]

#### Practice DataSets

- np.percentile(dataset,[25,75])- Sorts the data in Ascending manner and calculates 25%(Q1),75%(Q3) Values

In [13]:
dataset2 = [7,7,31,31,47,75,87,115,116,119,119,155,177]
print(detect_outliers(dataset2))
print(OutliersIQR(dataset2))

[]
[]


In [14]:
q1, q3 = np.percentile(dataset2,[25,75])
print(q1, q3)

31.0 119.0


In [16]:
dataset2_unsorted = [115,116,119,31,47,119,155,177,7,7,31,75,87]
print(detect_outliers(dataset2_unsorted))
print(OutliersIQR(dataset2_unsorted))

[]
[]


In [18]:
q1, q3 = np.percentile(dataset2_unsorted,[25,75])
print(q1, q3)

31.0 119.0
