## Dealing with Outlier

### Identification of Outlier 

#### BOX PLOT

Elements of Box Plot
A box plot gives a five-number summary of a set of data which is-

Minimum – It is the minimum value in the dataset excluding the outliers.
First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half above.
Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
Maximum – It is the maximum value in the dataset excluding the outliers.

![image.png](attachment:image.png)

The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR is calculated as –

IQR = Q3-Q1

Outlies are the data points below and above the lower and upper limit. The lower and upper limit is calculated as – 

Lower Limit = Q1 - 1.5*IQR

Upper Limit = Q3 + 1.5*IQR

How to create a box plots?
Let us take a sample data to understand how to create a box plot.

Here are the runs scored by a cricket team in a league of 12 matches – 100, 120, 110, 150, 110, 140, 130, 170, 120, 220, 140, 110.

To draw a box plot for the given data first we need to arrange the data in ascending order and then find the minimum, first quartile, median, third quartile and the maximum.

Ascending Order 
100, 110, 110, 110, 120, 120, 130, 140, 140, 150, 170, 220

Median (Q2) = (120+130)/2 = 125; Since there were even values

To find the First Quartile we take the first six values and find their median.

Q1 = (110+110)/2 = 110

For the Third Quartile, we take the next six and find their median.

Q3 = (140+150)/2 = 145

Note: If the total number of values is odd then we exclude the Median while calculating Q1 and Q3. Here since there were two central values we included them. Now, we need to calculate the Inter Quartile Range.

IQR = Q3-Q1 = 145-110 = 35

We can now calculate the Upper and Lower Limits to find the minimum and maximum values and also the outliers if any.

Lower Limit = Q1-1.5*IQR = 110-1.5*35 = 57.5

Upper Limit = Q3+1.5*IQR = 145+1.5*35 = 197.5


So, the minimum and maximum between the range [57.5,197.5] for our given data are – 

Minimum = 100

Maximum = 170

The outliers which are outside this range are – 

Outliers = 220

![image.png](attachment:image.png)

a) If the Median is at the center of the Box and the whiskers are almost the 
   same on both the ends then the data is Normally Distributed.
b) If the Median lies closer to the First Quartile and if the whisker at the lower
   end is shorter (as in the above example) then it has a Positive Skew (Right Skew).
c) If the Median lies closer to the Third Quartile and if the whisker at the
   upper end is shorter than it has a Negative Skew (Left Skew).

![image-2.png](attachment:image-2.png)



##### Z - Score Formula 
To calculate the z - score for any given data we need the value of the element along with the mean and standard deviation. A z - score can be calculated using the following Z - Score formula. 

z = (X - mean)/standard deviation 

There are also certain machine learning anomaly detection models including Isolation Forest and One = Class SVM which could be used to identify the outliers. 

Anomaly is a deviation from the expected or normal behavior or pattern. Anomaly detection is the process of identifying these unusual patterns or behaviors in data. It is important because anomalies can indicate important events or problems, such as fraudulent activity, equipment failure, or security breaches. 

After anomalies are identified, it is important to evaluate and validate the results to ensure that they are accurate and meaningful. This may involve comparing the results to known anomalies, or using domain knowledge to interpret the findings. 

There are many different algorithms and approaches to anomaly detection and choosing the right one can be a challenge. In this blog post, we will focus on anomaly detection algorithms in the Scikit Learn library. 

One of the most widely used algorithms for anomaly detection is the LOcal Outlier Factor(LOF) algorithm. This algorithm uses the local density of points in a dataset to identify anomalies. The basic idea is that, in most datasets, the density of points is relatively uniform, with only a few points having significantly lower or higher densities than the rest. The LOF algorithm uses this property to identify points that have a significantly lower density than their neighbors, which are likely to be anomalies. 

To use the LOF algorithm , we first need to choose a metric to measure the density of points. The most common choice is the KNN distance. 


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import LocalOutlierFactor

df = load_iris(as_frame=True).frame
X = df[['sepal length (cm)', 'sepal width (cm)']]

lof = LocalOutlierFactor(n_neighbors = 5)

lof.fit(X)

scores = lof.negative_outlier_factor_

outliers = np.argwhere(scores  < np.percentile(scores, 95))

colors = ['green', 'red']

for i in range(len(X)):
    if i not in outliers:
        plt.scatter(X.iloc[i,0], X.iloc[i,1], color = colors[0])
    else:
        plt.scatter(X.iloc[i,0], X.iloc[i,1], color = colors[1])

plt.xlabel('sepal length (cm)', fontsize = 13)
plt.ylabel('sepal width (cm)', fontsize = 13)
plt.title('Local Outlier Factor', fontsize = 15)
plt.show()


AttributeError: 'LocalOutlierFactor' object has no attribute 'positive_outlier_factor_'

Another popular algorithm for anomaly detection is the Isolation Forest algorithm. This algorithm uses decision tress to identify anomalies, by isolating points that are difficult to reach in the decision tree. The basic idea is that, in most datasets, the majority of points can be reached in the decision tree with only a few splits. Anomalies, on the other hand, are typically isolated from the rest of the data, requiring many splits to reach them in the decision tree. The Isolation Forest algorithm uses this property to identify anomalies by isolating points that are difficult to reach in the decision tree. 