## Identifying Outlier using different techniques

Removing Outlier is depend on the problem statement

**Machine Learning Algorithms that are sensitive to outliers:**
1. Linear Regression
2. Logistic Regression
3. KMeans Clustering
4. Hierarchical Clustering
5. PCA
6. Neural Networks
7. LDA
8. DBScan

**Machine Learning Algorithms that are not sensitive to outliers:**
1. Naive Bayes
2. SVM
3. Decision Tree
4. Ensemble Learning (RF, XGboost, GB)
5. K Nearest Neighbors (KNN)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [9]:
dataset = [14, 28, 23, 18, 14, 20, 21, 14, 17, 23, 108, 28, 17, 24, 25, 20, 29, 24, 17, 15, 18, 20, 107, 18, 13, 20, 26, 25, 16, 23, 17, 19, 18, 12, 24, 120, 29, 14, 13, 29, 28, 12, 17]

### Z-score
The Z-score is a measure of dispersion. It is calculated as the difference between a data point and the mean of the data. The higher the Z-score, the more the data point is away from the mean. </br>
Z = (x - mean) / standard deviation

In [10]:
outliers=[]

def detect_outlier(data):
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    for y in data:
        z_score = (y - mean) / std
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

In [11]:
outlier_pts = detect_outlier(dataset)

In [12]:
outlier_pts

[108, 107, 120]

Here we can see our function detect_outlier successfully detects the outliers i.e. 108, 107 and 120

### IQR
The Interquartile range is a measure of statistical dispersion. It is the difference between the 75th and 25th percentile of a data set. The higher the IQR, the more the data set is spread out.</br>
**Steps**
1. Arrange the data in ascending order
2. Calculate first (q1) quartile and third (q3) quartile
3. Calculate the Interquartile range (IQR) i.e. q3-q1
4. Calculate the upper and lower bounds of outliers i.e. q3 + 1.5*IQR and q1 - 1.5*IQR


Anything beyond these bounds is an outlier

In [18]:
sorted(dataset)

[12,
 12,
 13,
 13,
 14,
 14,
 14,
 14,
 15,
 16,
 17,
 17,
 17,
 17,
 17,
 18,
 18,
 18,
 18,
 19,
 20,
 20,
 20,
 20,
 21,
 23,
 23,
 23,
 24,
 24,
 24,
 25,
 25,
 26,
 28,
 28,
 28,
 29,
 29,
 29,
 107,
 108,
 120]

In [14]:
quartile_1, quartile_3 = np.percentile(dataset, [25, 75])

In [15]:
print(quartile_1,quartile_3)

17.0 25.0


In [16]:
# Fund the IQR
IQR = quartile_3 - quartile_1
print(IQR)

8.0


In [17]:
# Now Find the lower bound and the higher bound value
lower_bound = quartile_1 - (1.5 * IQR)
upper_bound = quartile_3 + (1.5 * IQR)
print(lower_bound,upper_bound)

5.0 37.0
