### Data Preprocessing Pipeline

Edited by Okot Pascal:

Data preprocessing is a critical step in data science tasks, ensuring that raw data is transformed into a clean, organized, and structured format suitable for analysis. A data preprocessing pipeline streamlines this complex process by automating a series of steps, enabling data professionals to efficiently and consistently preprocess diverse datasets.

#### What is a Data Preprocessing Pipeline?

Data Preprocessing involves transforming and manipulating raw data to improve its quality, consistency, and relevance for analysis. It encompasses several tasks, including handling missing values, standardizing variables, and removing outliers. By performing these preprocessing steps, data professionals ensure that subsequent analysis is based on reliable and accurate data, leading to better insights and predictions.

A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow. It serves as a roadmap for data professionals, guiding them through the transformations and calculations needed to cleanse and prepare data for analysis. The pipeline consists of interconnected steps, each of which is responsible for a specific preprocessing task, such as:

- Imputing missing values
- Scaling numeric features
- Finding and removing outliers
- Encoding categorical variables

By following the predefined sequence of operations, the pipeline ensures consistency, reproducibility, and efficiency in overall preprocessing steps.

#### How does a Data Preprocessing Pipeline Helps Data Professionals?

A Data Preprocessing pipeline is crucial to help various data science professionals, including data engineers, data analysts, data scientists, and machine learning engineers, in their respective roles.

+ For Data Engineers, the pipeline simplifies work by automating data transformation tasks, allowing them to focus on designing scalable data architectures and optimizing data pipelines.

+ Data Analysts benefit from the pipeline’s ability to normalize and clean data, ensuring accuracy and reducing time spent on data cleaning tasks. It allows analysts to spend more time on exploratory data analysis and gaining meaningful insights.

+ Data Scientists and Machine Learning Engineers rely on clean and well-preprocessed data for accurate predictive modelling and advanced analytics. The preprocessing pipeline automates repetitive preprocessing tasks, allowing them efficiently experiment and quickly iterate on their datasets.

Here’s how to create a Data Preprocessing pipeline using Python based on the fundamental functions that every pipeline should perform while preprocessing any dataset:

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
pd.set_option("display.max_columns", None)

In [2]:
# To act as the sample data
data = pd.read_excel('Ug data.xlsx', sheet_name='popln')
data = pd.read_excel('Ug data.xlsx', sheet_name='Diabetes')
data

Unnamed: 0,Gender,BMI,HbA1c_level
0,,25.19,
1,Female,,6.6
2,Male,27.32,5.7
3,Female,23.45,5.0
4,Male,20.14,4.8
5,Female,27.32,6.6
6,,19.31,6.5
7,Female,23.86,5.7
8,Male,33.64,4.8
9,Female,,5.0


In [3]:
def data_preprocessing_pipeline(data):
    # To identify categorical and numeric features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #To handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())
    
    #To detect and handle outliers in the numeric features using IQR
    for _ in numeric_features:
        Q1 = data[_].quantile(0.25)
        Q3 = data[_].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1-(1.5*IQR)
        upper_bound = Q3+(1.5*IQR)
        data[_] = np.where((data[_]<lower_bound) | (data[_]>upper_bound), data[_].mean(), data[_])
        
        #Normalise numeric features
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data[numeric_features])
        data[numeric_features] = scaler.transform(data[numeric_features])
        
        #To Handle missing values in categorical features
        data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])
        
    return data

The above pipeline is designed to handle various preprocessing tasks on any given dataset and the funtionality is explained below:

+ The pipeline begins by identifying the numeric and categorical features in the dataset.

+ It then addresses any missing values present in the numeric features. It fills these missing values with the mean value of each respective numeric feature (you can modify this step according to your desired way of filling in missing values of a numerical feature). It ensures that missing data does not hinder subsequent analysis and computations.

+ The pipeline then identifies and handles outliers within the numeric features using the Interquartile Range (IQR) method. Calculating the quartiles and the IQR determines upper and lower boundaries for outliers. Any values outside these boundaries are replaced with the mean value of the respective numeric feature. This step helps prevent the influence of extreme values on subsequent analyses and model building.

+ After handling missing values and outliers, the pipeline normalizes the numeric features. This process ensures that all numeric features contribute equally to subsequent analysis, avoiding biases caused by varying magnitudes.

+ The pipeline then proceeds to handle missing values in the categorical features. It fills these missing values with the mode value, representing the most frequently occurring category. However for the above data the categorical values don't have missing values.(This is only for Explanation purposes)

In [4]:
#To test the pipeline
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data")
data

Preprocessed Data


Unnamed: 0,Gender,BMI,HbA1c_level
0,Female,-0.337326,-4.08295e-16
1,Female,0.630865,1.309276
2,Male,0.456427,0.1080471
3,Female,-0.985743,-0.8262424
4,Male,-2.219228,-1.093182
5,Female,0.456427,1.309276
6,Female,-2.52853,1.175807
7,Female,-0.832955,0.1080471
8,Male,0.630865,-1.093182
9,Female,0.630865,-0.8262424


In [5]:
#To round off the data for easier manipulation and visualisation
df = data.round(4)
df

Unnamed: 0,Gender,BMI,HbA1c_level
0,Female,-0.3373,-0.0
1,Female,0.6309,1.3093
2,Male,0.4564,0.108
3,Female,-0.9857,-0.8262
4,Male,-2.2192,-1.0932
5,Female,0.4564,1.3093
6,Female,-2.5285,1.1758
7,Female,-0.833,0.108
8,Male,0.6309,-1.0932
9,Female,0.6309,-0.8262


So with the above data one can be able to have it in visualisations if need be or create models and much more

https://github.com/OkotPascal