<h3>Data Preprocessing Pipeline using Python</h3>

Data preprocessing is a critical step in data science tasks, ensuring that raw data is transformed into a clean, organized, and structured format suitable for analysis. A data preprocessing pipeline streamlines this complex process by automating a series of steps, enabling data professionals to efficiently and consistently preprocess diverse datasets.

Data Preprocessing involves transforming and manipulating raw data to improve its quality, consistency, and relevance for analysis. It encompasses several tasks, including handling missing values, standardizing variables, and removing outliers. By performing these preprocessing steps, data professionals ensure that subsequent analysis is based on reliable and accurate data, leading to better insights and predictions.

A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow. It serves as a roadmap for data professionals, guiding them through the transformations and calculations needed to cleanse and prepare data for analysis. The pipeline consists of interconnected steps, each of which is responsible for a specific preprocessing task, such as:

 1. imputing missing values
 2. scaling numeric features
 3. finding and removing outliers
 4. or encoding categorical variables

By following the predefined sequence of operations, the pipeline ensures consistency, reproducibility, and efficiency in overall preprocessing steps.

A Data Preprocessing pipeline is crucial to help various data science professionals, including data engineers, data analysts, data scientists, and machine learning engineers, in their respective roles.

For Data Engineers, the pipeline simplifies work by automating data transformation tasks, allowing them to focus on designing scalable data architectures and optimizing data pipelines.

Data Analysts benefit from the pipeline’s ability to normalize and clean data, ensuring accuracy and reducing time spent on data cleaning tasks. It allows analysts to spend more time on exploratory data analysis and gaining meaningful insights.

On the other hand, Data Scientists and Machine Learning Engineers rely on clean and well-preprocessed data for accurate predictive modelling and advanced analytics. The preprocessing pipeline automates repetitive preprocessing tasks, allowing them efficiently experiment and quickly iterate on their datasets.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [3]:
def data_preprocessing_pipeline(data):
    # Identify numerical and categorical features
    numeric_features = data.select_dtypes(exclude=['object']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    # Handling missing values in numerical features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    # Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                data[feature].mean(), data[feature])

    # Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    # Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data

In [4]:
data = pd.read_csv("archive/data.csv")

In [5]:
data.shape

(6, 3)

In [6]:
data.head(6)

Unnamed: 0,NumericFeature1,NumericFeature2,CategoricalFeature
0,1.0,7,A
1,2.0,8,B
2,,9,
3,4.0,10,A
4,5.0,11,B
5,6.0,50,C


In [7]:
# Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

In [8]:
print("The pre-processed data:\n", cleaned_data.head(6))

The pre-processed data:
    NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C
