### Add Outlier Detection step in Pipeline

In this notebook, we extend the pipeline of the previous notebook by introducing an important step in data preprocessing—**Outlier Detection**. 

Recall that outliers are data points that differ significantly from other observations and can sometimes distort the analysis or model performance. For instance, a very high or low value in a dataset can skew the mean, impact standard deviations, and affect model predictions.

To handle this, we implement a custom **OutlierRemover** using the **Interquartile Range (IQR)** method. The IQR ($ Q3-Q1$) measures the spread of the middle 50% of the data, and we define outliers as values that fall below the lower bound or above the upper bound, which are determined by calculating $ Q1 - 1.5 \times \text{IQR} $ and $ Q3 + 1.5 \times \text{IQR} $ respectively, where $ Q1 $ and $ Q3 $ are the first and third quartiles. These outlier values are then replaced with **NaN**, and missing values are subsequently imputed with the mean.

By introducing outlier detection into the pipeline, we ensure that any extreme values are handled before applying other transformations like scaling and model fitting. This results in a more robust and reliable model.


In [21]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [22]:
class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, factor=1.5):
        self.factor = factor
        self.lower_bound = []
        self.upper_bound = []

    def outlier_detector(self, X):
        # Calculate quartiles
        q1 = np.percentile(X, 25)
        q3 = np.percentile(X, 75)

        # Calculate IQR (Interquartile Range)
        iqr = q3 - q1

        # Calculate lower and upper bounds to identify outliers
        self.lower_bound.append(q1 - (self.factor * iqr))
        self.upper_bound.append(q3 + (self.factor * iqr))

    def fit(self, X, y=None):
        # Initialize lower and upper bounds
        self.lower_bound = []
        self.upper_bound = []

        # Apply the outlier_detector function along axis 0 (columns)
        np.apply_along_axis(self.outlier_detector, axis=0, arr=X)

        return self

    def transform(self, X, y=None):
        # Copy the input array to avoid unwanted changes
        X = np.copy(X)

        # Iterate over all columns
        for i in range(X.shape[1]):
            x = X[:, i]

            # Masks to identify outliers
            lower_mask = x < self.lower_bound[i]
            upper_mask = x > self.upper_bound[i]

            # Set values that are considered outliers to NaN (they will be next filled with the mean)
            x[lower_mask | upper_mask] = np.nan

            # Assign the transformed column back to the original array
            X[:, i] = x

        return X

In [23]:
# Load the diabetes dataset
diabetes = pd.read_csv('./datasets/diabetes.csv')

In [24]:
# Shuffling all samples to avoid group bias
diabetes = diabetes.sample(frac=1, random_state=42).reset_index(drop=True)

In [25]:
# Select features and target variable
selected_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                     'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes[selected_features].values
y = diabetes['Outcome'].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Define the pipeline with the custom OutlierRemover
numeric_features = list(range(X.shape[1]))

numeric_transformer = Pipeline(steps=[
    ('outlier_remover', OutlierRemover()),
    ('imputer', SimpleImputer(strategy='mean')), # Impute NaN values with the mean
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ],
    remainder='passthrough'
)
# In this case preprocessor manages the removal of outliers, imputation, and standardization of numerical features
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, C=1.0, penalty='l2'))
])

# Fit the model on the training data
pipeline.fit(X_train, y_train)

In [27]:
# Make predictions on the test data
y_pred = pipeline.predict(X_test)

In [28]:
# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:\n', classification_report_str)
print('Confusion Matrix:\n', conf_matrix)

Accuracy: 0.79
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.88      0.84        96
           1       0.76      0.64      0.69        58

    accuracy                           0.79       154
   macro avg       0.78      0.76      0.76       154
weighted avg       0.78      0.79      0.78       154

Confusion Matrix:
 [[84 12]
 [21 37]]
