# Finding Outliers

## Summary
This document will show how to find outliers in the dataset wss_manufacturer_classify. 

## Questions to be answered
- Why should we find the outliers in the dataset? How will the outliers on na_value_ratio affect the classification?
- What is the confidence level and confidence interval of the mean value of na_value_ratio ?

### Answers to the 1st Questions
- The outliers may influence the classification performance of the algorithms. In the initial data analysis, it shows that several data points are wrongly classified. The reason is that high na_value_ratio causes low mean value of the speed, which will lead to wrong classification. The task is to find them.

### Imports
Imports should be grouped in the following order:
1. Magics

2. Alphabetical order
    
    A. standard librarby imports
    
    B. related 3rd party imports
    
    C. local application/library specific imports

In [1]:
# Standard library
import os
import sys
# sys.path.append('../src/')

# Third party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math

# Local imports

In [2]:
# Customizations
sns.set() # matplotlib defaults

# Any tweaks that normally go in .matplotlibrc, etc., should be explicitly stated here
plt.rcParams['figure.figsize'] = (12,8)
%config InlineBackend.figure_format = 'retina'

In [4]:
# # Find  the notebook the saved figures came from
# fig_prefix = '../figures/2016-12-07-rm-'

### Load data

#### Import clean data

In [3]:
df = pd.read_pickle('../data/wss_manufacturer_classify') #this dataset is used to do the manufacturer classification

#### Problem statement
Statistical interval estimation will be used to calculate the interval of mean value of na_value_ratio. We concern more about the upper limit of na_value_ratio, so unilateral confidence interval is used in this sitiuatiion. The fomula used is $$(-\infty, \overline X + \frac{S}{\sqrt{n}}t_{\alpha}(n-1))$$ to obtain a confidence interval of na_value_ratio mean value with a confidence level of $1-\alpha $. Here we choose $\alpha = 0.005$, then $t_{0.005}(60) = 2.660$. $$(0, \frac{(n-1)S^{2}}{\chi_{1-\alpha }^{2}(n-1)})$$ is used to obatain a confidence interval of na_value_ratio square of standard deviation with a confidence level of $1-\alpha $. Here we choose $\alpha = 0.01$, then $\chi_{0.99}^{2}(60) = 88.379$ 

In [4]:
#There are 2 manufacturers in this dataset. So the confidence intervals will be calculated respectively.
df1 = df[df['manufacturer'] == 'Knorr Bremse'] # The data describing Knoorr Bremse
test1 = df1.sample(n = 61) # we choose 61 samples from the dataset
df2 = df[df['manufacturer'] == 'Haldex'] # The data describing Haldex
test2 = df2.sample(n = 61) # we choose 61 samples from the dataset

In [5]:
test1.describe()

Unnamed: 0,na_value_ratio,mean,max,std
count,61.0,61.0,61.0,61.0
mean,0.006879,7.60859,72.0,11.128119
std,0.007957,3.306473,24.941933,4.42264
min,0.0,0.4,1.0,0.534522
25%,0.0,5.5625,70.0,9.819748
50%,0.004717,6.735294,82.0,10.967289
75%,0.010228,9.942519,86.0,11.933306
max,0.027253,21.125,126.0,35.089427


In [6]:
test2.describe()

Unnamed: 0,na_value_ratio,mean,max,std
count,61.0,61.0,61.0,61.0
mean,0.054404,128.565361,248.852459,105.28211
std,0.076218,18.935021,5.638664,3.618832
min,0.0,68.666667,206.0,98.540989
25%,0.021277,123.785714,250.0,102.487565
50%,0.027893,133.414989,250.0,104.630707
75%,0.042714,139.768987,250.0,107.630308
max,0.4,158.092643,250.0,117.972313


From the output above, it can be find that:
- For the data from Knorr Bremse, $\overline X = 0.006879$, $S = 0.007957$. Then the confidence interval of na_value_ratio mean value is $(-\infty, 0.006879 + \frac{0.007957}{\sqrt{61}}*2.660$, which is $(0, 0.009589)$. Then the confidence interval of na_value_ratio standard deviation is $(0, \frac{60*0.007957*0.007957}{88.379})$, which is (0, 0.006556)
- For the data from Haldex, $\overline X = 0.054404$, $S = 0.076218$. Then the confidence interval is $(-\infty, 0.054404 + \frac{0.076218}{\sqrt{61}}*2.660$, which is $(0, 0.080362)$. Then the confidence interval of na_value_ratio standard deviation is $(0, \frac{60*0.076218*0.076218}{88.379})$, which is (0, 0.062800)

In [7]:
df1[df1['na_value_ratio'] > (0.009589 + 3*0.006556)] # The outliers from the manufacturer Knorr Bremse

Unnamed: 0,itapudid,na_value_ratio,manufacturer,mean,max,std


In [8]:
df2[df2['na_value_ratio'] > (0.080362 + 3*0.062800)] # The outliers from the manufacturer Haldex

Unnamed: 0,itapudid,na_value_ratio,manufacturer,mean,max,std
3,163540018001DC915C935,0.319157,Haldex,142.673547,250.0,101.989739
50,164730014001DC924DA0D,0.4,Haldex,68.666667,206.0,106.377943
127,164950018001DC92C87EE,0.76087,Haldex,22.888889,206.0,68.666667
