# Data Processing Pipeline using Python

### Cleaning & preparing raw-data to use in the data analysis and data science tasks.

A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow. It serves as a roadmap for data professionals, guiding them through the transformations and calculations needed to cleanse and prepare data for analysis. The pipeline consists of interconnected steps, each of which is responsible for a specific preprocessing task, such as:

1. imputing missing values
2. scaling numeric features
3. finding and removing outliers
4. or encoding categorical variables

By following the predefined sequence of operations, the pipeline ensures consistency, reproducibility, and efficiency in overall preprocessing steps.

## The pipeline simply automates the Data Transformation tasks

## Pipeline is built using Python

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

### Pipeline

In [2]:
def data_preprocessing_pipeline(data):
    
    # Identifying types of features/datatypes of columns
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns
    
    # Handling missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())
    
    # Handling missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])
    
    
    # Handling outliers in numeric features
    for feature in numeric_features:
        
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound)
                                 ,data[feature].mean()
                                 ,data[feature])
    
    
    # Normalizing numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])
    
    return data

In [3]:
data = pd.read_csv('Dataset.csv')

In [4]:
print("Original data: ")
print(data)

Original data: 
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


In [5]:
# Data Preprocessing
transformed_data = data_preprocessing_pipeline(data)

In [6]:
print("Cleaned Data: ")
print(transformed_data)

Cleaned Data: 
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C


## Conclusion

Data Preprocessing involves transforming and manipulating raw data to improve its quality, consistency, and relevance for analysis. A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow.