### A Data Preprocessing pipeline should be able to handle missing values, standardize numerical features, remove outliers, and ensure easy replication of preprocessing steps on new datasets. Now, hereâ€™s how to create a Data Preprocessing pipeline using Python based on the fundamental functions that every pipeline should perform while preprocessing any dataset:

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
    #Identify numeric and categorical features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #Handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    #Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])

    #Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    #Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data

In [5]:
data = pd.read_csv("Data.csv")

print("Original Data:")
print(data)

Original Data:
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [6]:
#Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data:")
print(cleaned_data)

Preprocessed Data:
   Country       Age        Salary Purchased
0   France  0.758874  7.494733e-01        No
1    Spain -1.711504 -1.438178e+00       Yes
2  Germany -1.275555 -8.912655e-01        No
3    Spain -0.113024 -2.532004e-01        No
4  Germany  0.177609  6.632192e-16       Yes
5   France -0.548973 -5.266569e-01       Yes
6    Spain  0.000000 -1.073570e+00        No
7   France  1.340140  1.387538e+00       Yes
8  Germany  1.630773  1.752147e+00        No
9   France -0.258340  2.937125e-01       Yes


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_features = data.select_dtypes(include=['object']).columns
