### Outlier?
- In statistics, an outlier is an observation point that is distant from other observations.([wiki](https://en.wikipedia.org/wiki/Outlier))
- An outlier is a data point in a data set that is distant from all other observations. 
- A data point that lies outside the overall distribution of the dataset.


### Criteria to identify an outlier?

- If the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers
- Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile

### Reason for an outlier to exists in a dataset?

- Mistake during data collection
- variance in our data
- An experimental measurement error

### Impacts of having outliers in a dataset?

- It causes various problems during our statistical analysis
- It may cause a significant impact on the mean and the standard deviation

### Various ways of finding the outlier.
1. Discover outliers with visualization tools
  - Scatter plots
  - Box plot
2. Discover outliers with mathematical function
  - z score
  - IQR interquantile range



In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

## Discover outliers with mathematical function

### Z score

[**Wiki Definition:**](https://en.wikipedia.org/wiki/Standard_score)

The **Z-score** is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.

Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [3]:
outlier=[]
def detect_outlier(data):
    
    threshold=3
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outlier.append(i)
    return outlier

In [4]:
outlier_data=detect_outlier(dataset)

In [5]:
outlier_data

[102, 107, 108]

### IQR InterQuantile Range


[**Wikipedia Definition:**](https://en.wikipedia.org/wiki/Interquartile_range)

The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.

In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.

This is much more robust against outliers.


### Steps
1. Arrange the data in increasing order
2. Calculate first quartile and third quartile
3. Find interquartile range
4. Find lower bound 
5. Find upper bound 

Anything that lies outside of lower and upper bound is an outlier.


In [6]:
## data sorting
dataset= sorted(dataset)

In [7]:
dataset = pd.DataFrame(dataset)

NameError: name 'pd' is not defined

In [None]:
quantile1 = dataset.quantile(0.25)
quantile3 = dataset.quantile(0.75)
print(quantile1.values,quantile3.values)

In [None]:
## Find the IQR
iqr_value=quantile3-quantile1
print(iqr_value)

In [None]:
## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr_value) 
upper_bound_val = quantile3 +(1.5 * iqr_value) 

In [None]:
print(lower_bound_val.values,upper_bound_val.values)

In [None]:
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
#create the dataframe
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()

## Discover outliers with visualization tools

### Box plot

[**Wiki Definition:**](https://en.wikipedia.org/wiki/Box_plot)

In descriptive statistics, a **box plot** is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. **Outliers** may be plotted as individual points.

In [None]:
import seaborn as sns
sns.boxplot(dataset);

Above plot shows three points, these are outliers as there are not included in the box of other observation i.e no where near the quartiles.

In [None]:
sns.boxplot(boston_df['DIS']);

Above plot shows three points between 10 to 12, these are outliers as there are not included in the box of other observation

### Scatter plot

[**Wiki Defintion:**](https://en.wikipedia.org/wiki/Scatter_plot)

A **scatter plot** , is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_df['INDUS'], boston_df['TAX'])
ax.set_xlabel('Proportion of non-retail business acres per town')
ax.set_ylabel('Full-value property-tax rate per $10,000')
plt.show()

Looking at the plot above, we can say most of data points are lying bottom left side but there are points which are far from the population like top right corner.