<a href="https://colab.research.google.com/github/MathMachado/DSWP/blob/master/Notebooks/3DP_3_Outliers%20Handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3DP - Outliers Handling

> __In statistics, an outlier is an observation point that is distant from other observations.__ - Wikipedia

# Machine Learning com Python (Scikit-Learn)

![Scikit-Learn](https://github.com/MathMachado/Python_RFB/blob/master/Material/scikit-learn-1.png?raw=true)

# O que é Anomaly Detection?
> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações.


# Principais Algoritmos e Estratégias para lidar com Anomaly Detection

* Density-based Approaches:
    * RKDE - Robust Kernel Density Estimation (Kim & Scot, 2008);
    * EGMM - Ensemble Gaussian Mixture Model;
* Quantile-based Methods:
    * OCSVM - One-class SVM (Schoelkopf, et all., 1999);
    * SVDD - Support Vector Data Description (Tax & Duin, 2004);
* Neighbor-based Methods:
    * LOF: Local Outlier Factor (Breuning, et all., 2000);
    * ABOD: K-NN Angle-based Outlier Detection (Kriegel, et al., 2008);
* Projection-based Methods:
    * IFOR: Isolation Forest (Liu, et al., 2008);
    * LODA: Lightweight Online Detection of Anomalies (Pevny, 2016)

# Onde encontrar mais informações sobre Anomaly Detection?
[anomaly-detection-resources](https://github.com/MathMachado/anomaly-detection-resources)

# Carregar as bibliotecas (genéricas) Python

In [0]:
!pip install bamboolib

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import bamboolib
import matplotlib.pyplot as plt

# remove warnings to keep notebook clean
import warnings
warnings.filterwarnings('ignore')

# Carregar Dados

## Titanic

In [0]:
import seaborn as sns
import pandas as pd
import seaborn as sns

df_Titanic = sns.load_dataset('titanic')
df_Titanic= df_Titanic[['age','fare']]
df_Titanic = df_Titanic.dropna()
df_Titanic.head()

In [0]:
# Importar biblioteca para transformação dos dados
from sklearn.preprocessing import StandardScaler

df_Age_Fare = StandardScaler().fit_transform(df_Titanic)
df_Age_Fare

# Boxplot

![BoxPlot](https://github.com/MathMachado/Python_RFB/blob/master/Material/boxplot.png?raw=true)

In [0]:
def PlotaBoxPlot_Survived(df, column):
    plt.rcdefaults()
    # make boxplot with Catplot
    sns.catplot(x='survived', y= column, kind="box", data=df, height=4,aspect=1.5)
    
    # add data points to boxplot with stripplot
    sns.stripplot(x='survived', y= column, data=df, alpha=0.3,jitter=0.2,color='k');
    plt.show()

In [0]:
PlotaBoxPlot_Survived(df_Titanic, 'fare')

In [0]:
PlotaBoxPlot_Survived(df_Titanic, 'age')

# Z-Score

* Z-Score pode ser utilizado para detectar Outliers.
* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. 
* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.

![Z_Score](https://github.com/MathMachado/Python_RFB/blob/master/Material/Z_Score.png?raw=true)

Abaixo, definimos a função para detectar os outliers baseados no Z-Score:

In [0]:
from scipy.stats import zscore

def ZScore_Outlier_Detect(df, column):
    df[column+'_ZS'] = zscore(df[column])
    df[column+'__is_outlier_ZS'] = df[column+'_ZS'].apply(lambda x: x <= -2.5 or x >= 2.5)
    df_AUX= df[df[column+'__is_outlier_ZS']== False]
    min_vlr= df_AUX[column].min()
    max_vlr= df_AUX[column].max()    
    df[column+'_Outlier_ZS']= df[column]
    
    df.loc[df[column+'_Outlier_ZS'] < min_vlr, column+'_Outlier_ZS'] = min_vlr
    df.loc[df[column+'_Outlier_ZS'] > max_vlr, column+'_Outlier_ZS'] = max_vlr
 
    #df.drop(columns= [column+'_ZS', column+'__is_outlier_ZS'], axis=1, inplace= True)
    df.drop(columns= [column+'_ZS'], axis=1, inplace= True)

Avaliando Outlier pelo critério do Z-Score:

In [0]:
lFeatures= ['age', 'fare']	
for Features in lFeatures:
    ZScore_Outlier_Detect(df_Titanic, Features)

In [0]:
df_Titanic.head(100)

# IQR Score

* O Intervalo interquartil (IQR) é uma medida de dispersão estatística, sendo igual à diferença entre os percentis 75 e 25, ou entre quartis superiores e inferiores, IQR = Q3 - Q1.

![BoxPlot](https://github.com/MathMachado/Python_RFB/blob/master/Material/boxplot.png?raw=true)

In [0]:
def IQR_Score_Outlier_Detect(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    Lim_Inf= Q1-1.5*IQR
    Lim_Sup= Q3+1.5*IQR
    df[column+'__is_outlier_IQR'] = df[column].apply(lambda x: x <= Lim_Inf or x >= Lim_Sup)
    
    df_AUX= df[df[column+'__is_outlier_IQR']== False]
    min_vlr= df_AUX[column].min()
    max_vlr= df_AUX[column].max()    
    df[column+'_Outlier_IQR']= df[column]
    
    df.loc[df[column+'_Outlier_IQR'] < min_vlr, column+'_Outlier_IQR'] = min_vlr
    df.loc[df[column+'_Outlier_IQR'] > max_vlr, column+'_Outlier_IQR'] = max_vlr
 
    #df.drop(columns= [column+'__is_outlier_IQR'], axis=1, inplace= True)

Abaixo, a função para detectar outliers baseados no IQR Score:

In [0]:
for Features in lFeatures:
    IQR_Score_Outlier_Detect(df_Titanic, Features)

In [0]:
 #IQR_Score_Outlier_Detect(df_Titanic, 'fare')

In [0]:
df_Titanic.head(100)

# Métodos Bivariados

![MetodosBivariados](https://github.com/MathMachado/Python_RFB/blob/master/Material/Clusters.png?raw=true)

[Fonte](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)

O objetivo destes métodos é construir clusters de dados, considerando atributos dois a dois. A ideia é que pontos que não estão grupados (como os pontos pretos na figura acima) podem ser considerados outliers.

# Comparação de Resultados

![CompararAnomalyDetection](https://github.com/MathMachado/Python_RFB/blob/master/Material/ComparingAnomalyDetection.png?raw=true)

[Fonte](https://scikit-learn.org/0.20/auto_examples/plot_anomaly_comparison.html)

# DBSCAN — Density-Based Spatial Clustering of Applications with Noise
* DBSCAN é uma forma de clustering baseado em densidade;

In [0]:
from sklearn.preprocessing import StandardScaler
df_Age_Fare = StandardScaler().fit_transform(df_Titanic)
df_Age_Fare

In [0]:
from sklearn.cluster import DBSCAN
Outlier_Detection = DBSCAN(eps=0.5, metric='euclidean', min_samples=5)
clusters = Outlier_Detection.fit_predict(df_Age_Fare)
lFeature= ['age', 'fare']
print(f'Clusters: {set(clusters)}')

In [0]:
# Gráfico dos clusters
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

Portanto, use esta técnica para identificar outliers multivariada (dois a dois).

# OneClassSVM With Kernel RBF (Radial Basis Function)

In [0]:
from sklearn import svm

# cluster the data into five clusters
Outlier_detection = svm.OneClassSVM(kernel='linear', gamma=0.001, nu=0.95)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
lFeature= ['Age', 'Fare']
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Isolation Forests (*)
* Based on RandomForest
* Useful in detecting outliers in high dimension datasets.
* This algorithm randomly selects a feature & splits further.
* Random partitioning produces shorter part for anomolies.
* When a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

In [0]:
from sklearn.ensemble import IsolationForest

# cluster the data into five clusters
Outlier_detection = IsolationForest(max_samples=100,random_state= 20111974, contamination=.1)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Local Oulier Factor - LOF

* Based on nearest neighbours;
* Suited for moderately high dimension datasets;
* LOF computes a score reflecting degree of abnormility of a data.
* An abnormal data is expected to have smaller local density.
LOF tells you not only how outlier the data is but how outlier is it with respect to all data.

In [0]:
from sklearn.neighbors import LocalOutlierFactor

Outlier_detection = LocalOutlierFactor(n_neighbors= 10, contamination=.1)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Eliptic Envelope
* The assumption here is, regular data comes from known distribution ( Gaussion distribution )
* Inliner location & variance will be calculated using Mahalanobis distances which is less impacted by outliers.
* Calculate robust covariance fit of the data.
* Detecta outliers através da "Robust Covariance" que vimos na figura do início desta sessão.

In [0]:
from sklearn.covariance import EllipticEnvelope

Outlier_detection = EllipticEnvelope(contamination=.1,random_state= 201119740)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Using Gaussian Mixture Models - GMM
* Data might contain more than one peaks in the distribution of data.
* Trying to fit such multi-model data with unimodel won't give a good fit.
* GMM allows to fit such multi-model data.
* We will see how GMM can be used to find outliers.

In [0]:
from sklearn.mixture import GaussianMixture

In [0]:
plt.scatter(df_Titanic['age'], df_Titanic['fare'], s=10)

In [0]:
gmm = GaussianMixture(n_components=3)
gmm.fit(df_Titanic)
pred = gmm.predict(df_Titanic)

In [0]:
plt.scatter(df_Titanic['age'], df_Titanic['fare'],s=10,c=pred)

# Library PyOD - k-NN Classification Method

In [0]:
from pyod.models.knn import KNN

Outlier_detection = KNN(contamination= .021, n_neighbors= 5)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Detecção Multivariada: Library PyOD - PCA

In [0]:
from pyod.models.pca import PCA

Outlier_detection = PCA()
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

### ABOD (Angle Based Outlier Detection
* Considera a relação entre o ponto e seus vizinhos; 
* A biblioteca PyOD oferece 2 versões do ABOD:
    * Fast ABOD: Usa k-NN para aproximar;
    * Original ABOD

In [0]:
from pyod.models.abod import ABOD
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

In [0]:
Outlier_detection = ABOD(method= 'fast', contamination= 0.1, n_neighbors= 5)
clusters = Outlier_detection.fit_predict(df_Age_Fare)
print(f'Clusters: {set(clusters)}')

In [0]:
# plot the cluster assignments
plt.scatter(df_Age_Fare[:, 0], df_Age_Fare[:, 1], c=clusters, cmap="plasma")
plt.xlabel('Age')
plt.ylabel('Fare')

# Exercicios
* Aplique estas transformações nos seguintes dataframes:

Exercício 1 - Predict Breast Cancer

In [0]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X= cancer['data']
y= cancer['target']

df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))
df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})
df_cancer.head()

## Exercício 1 - Boston Housing Price

In [0]:
from sklearn.datasets import load_boston

boston = load_boston()
X= boston['data']
y= boston['target']

df_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))
df_boston.head()

## Exercício 2 - Iris
* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira.

In [0]:
from sklearn.datasets import load_iris

iris = load_iris()
X= iris['data']
y= iris['target']

df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))
df_iris['target'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
df_iris.head()

## Exercícios 3 - Diabetes

In [0]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X= diabetes['data']
y= diabetes['target']

df_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))
df_diabetes.head()