<img src="https://i.imgur.com/sSS6VKv.png" width="750" height="300" align="center"/>

<br>
<h1 style = "font-size:30px; font-weight : bold; color : blue; text-align: center; border-radius: 10px 15px;"> Custom Outlier Removal within a Pipeline </h1>
<br>

---

## Overview

The goal of this short notebook is to show how to use a custom function for outlier removal within a pipeline. Including this step in the pipeline ensures that samples with outliers are only removed from the training data, leaving the test data as it is.

The resulting pipeline can be tested with K-fold Cross Validation, where, for each iteration, the training folds will have its outliers removed while the test/validation fold will be completed used for prediction, giving us a more realistic notion of the model's performance in 'unseen' data.
    
To test this approach, I chose two models, Logistic Regression and K-Nearest Neighbors, both which are sensitive to outliers. I’ve compared them with and without the outlier removal function and these were their accuracies on 10-fold CV (with vs without our function):
- Logistic Regression: 77,12% vs 76,38%
- K-Nearest Neighbors: 73,60% vs 72,43%
 
The motivation for this notebook came from a question I had ([and solved after this answer](https://www.kaggle.com/discussion/230284#1261498)) about how to remove outliers inside K-fold CV to assess their impact on the model. I hope this notebook can be helpful for other beginners like myself. If you have a different approach, feel free to share. If you find useful, please consider upvoting.

## Importing Libraries

Instead of the sklearn pipeline, we're going to use the imblearn pipeline. The advantage of imblearn is that it allows the use of different samplers to deal with imbalanced data. The imblearn package provides its own set of samplers, but we can also use a custom sampler with imblearn.FunctionSampler. We can take advantage of those features to create our function to remove outliers and call it within the pipeline as a sampler.

In [None]:
import pandas as pd       
import matplotlib as mat
import matplotlib.pyplot as plt    
import numpy as np
import seaborn as sns
%matplotlib inline

from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler


import warnings
warnings.filterwarnings('ignore')

## Checking the Data

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df

In [None]:
df.info()

In [None]:
df.isin([0]).sum()

As stated in this [paper](https://www.sciencedirect.com/science/article/pii/S2352914816300016) that makes use of the same dataset, the value of zero was recorded in place of missing experimental observations. So for all features, except number of pregnancies, we could assume that '0' represents a missing value.

Although it is not the intent of this notebook, it might be useful to point this out for those who will use this dataset.

In [None]:
#Replacing zeros for missing values in five features
df.loc[:,'Glucose':'BMI'] = df.loc[:,'Glucose':'BMI'].replace(0,np.nan)

print('Values = 0\n')
print(df.isin([0]).sum())
print('\nValues = nan\n')
print(df.isnull().sum())

In [None]:
Y = df['Outcome']
X = df.copy().drop('Outcome', axis = 1)
features = X.columns.tolist()

## Visualizing the outliers

In this notebook, I'm considering as outliers the values beyond the limits based on the [Interquartile Range (IQR) Method](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/). I recommend you to take a look at another methods (uni and multivariate) to detect outliers, including those already implemented in the sklearn package.

An easy way of displaying the distribution of each feature and check for outliers is the use of boxplots.

In [None]:
plt.figure(figsize=(16,10))

for i,col in enumerate(features):    
    plt.subplot(2,4,i + 1)
    sns.boxplot(y=col, data=df)
    #plt.ylabel('')

plt.tight_layout()

plt.show()

Apparently, most features have a considerable share of outliers. We can take a better look at them by writing a function to highlight each outlier. We calculate the upper and lower limits for each feature (based on IQR Method) and, for every sample that falls outside those limits, we display their value and index number.

In [None]:
def IQR_Outliers (X, features):

    print('# of features: ', len(features))
    print('Features: ', features)

    indices = [x for x in X.index]
    #print(indices)
    print('Number of samples: ', len(indices))
    
    out_indexlist = []
        
    for col in features:
       
        #Using nanpercentile instead of percentile because of nan values
        Q1 = np.nanpercentile(X[col], 25.)
        Q3 = np.nanpercentile(X[col], 75.)
        
        cut_off = (Q3 - Q1) * 1.5
        upper, lower = Q3 + cut_off, Q1 - cut_off
        print ('\nFeature: ', col)
        print ('Upper and Lower limits: ', upper, lower)
                
        outliers_index = X[col][(X[col] < lower) | (X[col] > upper)].index.tolist()
        outliers = X[col][(X[col] < lower) | (X[col] > upper)].values
        print('Number of outliers: ', len(outliers))
        print('Outliers Index: ', outliers_index)
        print('Outliers: ', outliers)
        
        out_indexlist.extend(outliers_index)
        
    #using set to remove duplicates
    out_indexlist = list(set(out_indexlist))
    out_indexlist.sort()
    print('\nNumber of rows with outliers: ', len(out_indexlist))
    print('List of rows with outliers: ', out_indexlist)
    
    
IQR_Outliers(X, features)

We can see that 84 out of 768 samples were marked as outliers, which is a significant portion of our dataset. It is worth to point out that dropping every outlier without an effort to understand them is not exactly a good practice. For instance, looking at the 'Age' feature we see that any value higher than 66 is marked as an outlier. By dropping every one of those samples, we lose information about a significant group of the population that our dataset is meant to represent.

Given the intent of this notebook, we will make it simple and drop every sample outside the limits. We can adapt the previous function to receive all the (training) data and return a clean dataset.

In [None]:
def CustomSampler_IQR (X, y):
    
    features = X.columns
    df = X.copy()
    df['Outcome'] = y
    
    indices = [x for x in df.index]    
    out_indexlist = []
        
    for col in features:
       
        #Using nanpercentile instead of percentile because of nan values
        Q1 = np.nanpercentile(df[col], 25.)
        Q3 = np.nanpercentile(df[col], 75.)
        
        cut_off = (Q3 - Q1) * 1.5
        upper, lower = Q3 + cut_off, Q1 - cut_off
                
        outliers_index = df[col][(df[col] < lower) | (df[col] > upper)].index.tolist()
        outliers = df[col][(df[col] < lower) | (df[col] > upper)].values        
        out_indexlist.extend(outliers_index)
        
    #using set to remove duplicates
    out_indexlist = list(set(out_indexlist))
    
    clean_data = np.setdiff1d(indices,out_indexlist)

    return X.loc[clean_data], y.loc[clean_data]

## Building and Testing the Pipeline

After creating our function, we can build the pipeline. First, we use the 'FunctionSampler' to call our function as a sampler. After that, we call an imputer to fill the missing values, and then we define which classifier will be used. In this case, we'll start by building two pipelines, one for Logistic Regression and other for K-Nearest Neighbors.

In [None]:
LR_Pipeline = Pipeline([('Outlier_removal', FunctionSampler(func=CustomSampler_IQR, validate = False))
                        ,('Imputer', SimpleImputer(strategy = "median"))
                        ,('LR',  LogisticRegression(C = 0.7, random_state = 42, max_iter = 1000))])


KNN_Pipeline = Pipeline([('Outlier_removal', FunctionSampler(func=CustomSampler_IQR, validate = False))
                        ,('Imputer',SimpleImputer(strategy = "median"))
                        ,('KNN', KNeighborsClassifier(n_neighbors=7))])

Now we can assess the performance of our models using cross-validation. For each iteration, the training folds will be scanned for outliers and cleaned while the validation fold will be entirely used for prediction.

In [None]:
rp_st_kfold = RepeatedStratifiedKFold(n_splits=10, n_repeats = 3, random_state = 42)

cv_score = cross_val_score(LR_Pipeline, X, Y, cv=rp_st_kfold, scoring='accuracy')
print("Logistic Regression - Acc(SD): {0:0.4f} ({1:0.4f})". format(cv_score.mean(), cv_score.std()))

cv_score = cross_val_score(KNN_Pipeline, X, Y, cv=rp_st_kfold, scoring='accuracy')
print("K-Nearest Neighbors  - Acc(SD): {0:0.4f} ({1:0.4f})". format(cv_score.mean(), cv_score.std()))

For comparison purposes, we can test the same models without removing any outliers. As we can see, the accuracies of our first set of pipelines were a little higher. It's important to remember that our function marked more than 10% of this dataset as outliers. You can explore the data a little further and try different methods to remove or transform outliers to improve the accuracy even more.

In [None]:
LR_with_outliers = Pipeline([('Imputer', SimpleImputer(strategy = "median"))
                        ,('LR',  LogisticRegression(C = 0.7, random_state = 42, max_iter = 1000))])

KNN_with_outliers = Pipeline([('Imputer',SimpleImputer(strategy = "median"))
                        ,('KNN', KNeighborsClassifier(n_neighbors=7))])

cv_score = cross_val_score(LR_with_outliers, X, Y, cv=rp_st_kfold, scoring='accuracy')
print("Logistic Regression - Acc(SD): {0:0.4f} ({1:0.4f})". format(cv_score.mean(), cv_score.std()))

cv_score = cross_val_score(KNN_with_outliers, X, Y, cv=rp_st_kfold, scoring='accuracy')
print("K-Nearest Neighbors  - Acc(SD): {0:0.4f} ({1:0.4f})". format(cv_score.mean(), cv_score.std()))

That's it. I hope this notebook can be useful for some people. If you have any suggestions or questions, let me know.

## References:

https://towardsdatascience.com/enrich-your-train-fold-with-a-custom-sampler-inside-an-imblearn-pipeline-68f6dff964bf

https://imbalanced-learn.org/dev/references/generated/imblearn.FunctionSampler.html


https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/





## <center> If you find this notebook useful, support with an upvote! <center>