In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### What is an outlier?

Outlier is an observation that is numerically distant from the rest of the data or in a layman word it is the value which is out of the range. Let us take an example to check what happens to a data set with and data set without outliers.
    
                              Data without outlier		Data with outlier
                     
         Data		           1,2,3,3,4,5,4		     1,2,3,3,4,5,400
 
         Mean		           3.142		             59.714

         Median		           3		                 3

         Standard Deviation    1.345185		             150.057

Here you can see data set with outlier has significantly different mean and standard deviation. In the first scenario, average is 3.14. But with outlier averages to 59.71, This would change the estimate completely.
 
Let us take a real world example. In a company of 50 employees, 45 people having monthly salary of Rs.6000, 5 scenior employees having monthly sarlary of  Rs.1000000 each. If you calculate the average monthly salary of employees in the company is Rs. 14,500, which will give you the wrong conclusion (majority of employees have lesser than 14.5k salary). But if you take median salary, it is Rs.6000 which is more sense than the average. For this reason median is appropriate measure than mean. 

Outlier is a commonly used terminoloy by analysts and data scientists as it needs close attention else it can result in widly wrong estimations. Simply speaking, outlier is an observation that appears far away and diverges from an overall pattern in a sample.

### Causes For Outliers

1. Data entry Errors
2. Measurement Errors
3. Natural Otlier


### Outlier Detection

Outlier can be of two types:

a. Univariate

The above discussed is an example of univariate outlier. These outliers can be found when we look at distribution of a single variable.

b. Multi-variate

Outliers in an n-dimensional space


### Different outlier detection technique

1. Hypothesis Testing
2. Z-score method
3. Robust Z-score
4. I.Q.R method
5. Winsorization method(Percentile Capping)
6. DBSCAN Clustering
7. Isolation Forest
8. Visualizing the data


### 1. Hypothesis Testing (GRUBBS TEST)

 
$$
\begin{array}{l}{\text { Grubbs' test is defined for the hypothesis: }} \\ {\begin{array}{ll}{\text { Ho: }}  {\text { There are no outliers in the data set }} \\ {\mathrm{H}_{\mathrm{1}} :}  {\text { There is exactly one outlier in the data set }}\end{array}}\end{array}
$$
$$
\begin{array}{l}{\text {The Grubbs' test statistic is defined as: }} \\ {\qquad G_{calculated}=\frac{\max \left|X_{i}-\overline{X}\right|}{SD}} \\ {\text { with } \overline{X} \text { and } SD \text { denoting the sample mean and standard deviation, respectively. }} \end{array}
$$
$$
G_{critical}=\frac{(N-1)}{\sqrt{N}} \sqrt{\frac{\left(t_{\alpha /(2 N), N-2}\right)^{2}}{N-2+\left(t_{\alpha /(2 N), N-2}\right)^{2}}}
$$

\begin{array}{l}{\text { If the calculated value is greater than critical, you can reject the null hypothesis and conclude that one of the values is an outlier }}\end{array}

In [4]:
import numpy as np
import scipy.stats as stats
x = np.array([12,13,14,19,21,23])
y = np.array([12,13,14,19,21,23,45])

def grubbs_test(x):
    n = len(x)
    mean_x = np.mean(x)
    sd_x = np.std(x)
    numerator= max(abs(x-mean_x))
    g_calculated = numerator/sd_x
    print("Grubbs Calculated Value: ", g_calculated)
    t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
    g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    print("Grubbs Critical Value:",g_critical)
    if g_critical > g_calculated:
        print("From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers\n")
    else:
        print("From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers\n")

In [5]:
grubbs_test(x)

Grubbs Calculated Value:  1.4274928542926593
Grubbs Critical Value: 1.887145117792422
From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers



In [6]:
grubbs_test(y)

Grubbs Calculated Value:  2.2765147221587774
Grubbs Critical Value: 2.019968507680656
From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers



### 2. Z-Score Method

Using Z score method, we can find how many standard deviations value away from the mean.

<img style="float: center;"  src="https://i.pinimg.com/originals/cd/14/73/cd1473c4c82980c6596ea9f535a7f41c.jpg" width="350px">

Figure in the left shows area under normal curve and how much area that standard deviation covers
1. 68% of the data points lie between + or -1 SD
2. 95% of the points lie between + or -2 SD.
3. 99.7% of the data points lie between + or -3 SD.

Z-score formula 

    Zscore = X-Mean / Standard Deviation

If the z score of the data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other values. It is taken as outliers.

In [9]:
import pandas as pd
import numpy as np

train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
out=[]

def Zscore(df):
    for i in df:
        z = (i-np.mean(df))/np.std(df)
        print(z)
        if np.abs(z) > 3:
            out.append(i)
    print("Outliers: ", out)


In [10]:
Zscore(train['LotArea'])

Outliers:  [50271, 159000, 215245, 164660, 53107, 70761, 53227, 46589, 115149, 53504, 45600, 63887, 57200]


In [11]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
