# Outlier Analysis

An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors. 
Machine learning algorithms are very sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns #visualization tools

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
ad=pd.read_csv("../input/advertising/advertising.csv")
df = ad.copy()
df = df.select_dtypes(include = ['float64', 'int64']) #we chose only numeric variables
df.head()

In [None]:
df_table = df["Area Income"].copy()

## Boxplot

In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers), indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker diagram and box-and-whisker diagram. Outliers may be plotted as individual points.

The definition above suggests that if there is an outlier it will plotted as point in boxplot but other population will be grouped together and display as boxes. Let’s try and see it ourselves :) 

In [None]:
sns.boxplot(x = df_table)

## IQR 

The IQR describes the middle 50% of values. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

In [None]:
Q1 = df_table.quantile(0.25)
Q3 = df_table.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1- 1.5*IQR
upper_bound = Q3 + 1.5*IQR
print("lower bound is " + str(lower_bound))
print("upper bound is " + str(upper_bound))
print(Q1)
print(Q3)

In [None]:
(df_table < (lower_bound)) | (df_table > (upper_bound))

In [None]:
outliers_vector = (df_table < (lower_bound)) | (df_table > (upper_bound) )
outliers_vector

In [None]:
outliers = df_table[outliers_vector]

In [None]:
outliers.index #we obtained indexes of outlier

In [None]:
df_table.shape

### Outlier Analysis Delete

In [None]:
clean_df_table = df_table[~((df_table<(lower_bound)) | (df_table > (upper_bound)))] 
# We only hold data within boundary conditions

In [None]:
clean_df_table.shape

###  Filling with Average

In [None]:
df_table = df["Area Income"].copy()

In [None]:
sns.boxplot(x= df_table)

In [None]:
df_table.mean()

In [None]:
df_table.describe()

In [None]:
df_table[outliers_vector] = df_table.mean()

In [None]:
df_table[outliers_vector].head()

In [None]:
df_table.describe()

###  Filling with printing method

If we are not sure that contradictory observations are contradictory or if we want to take into account the inconsistencies in our data, suppression may be a good method.

In [None]:
df_table = df["Area Income"].copy()
print(df_table.min())
print(df_table.max())

In [None]:
table_min = df_table.min()
table_max = df_table.max()
for e in range(len(df_table)):
    if df_table.iloc[e] < lower_bound:
        df_table.iloc[e] = lower_bound
    
    elif df_table.iloc[e] > upper_bound:
        df_table.iloc[e] = upper_bound
        

In [None]:
df_table.min()

In [None]:
df_table.max()

###  OR

In [None]:
outliers_lower_vector = (df_table < (lower_bound))

In [None]:
outliers_upper_vector = (df_table > (upper_bound))

In [None]:
df_table[outliers_lower_vector] = lower_bound
df_table[outliers_upper_vector] = upper_bound

In [None]:
df_table.max()

In [None]:
df_table.min()