## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
pipeline = Pipeline([
    ('scaler', StandardScaler()) 
])
scaled_data = pipeline.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=iris.feature_names)
print(scaled_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:

#.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
data = {
    'age': [25, 30, np.nan, 45, 50],
    'salary': [50000, 60000, 55000, np.nan, 70000]
}
df = pd.DataFrame(data)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')) 
])
cleaned_data = pipeline.fit_transform(df)
cleaned_df = pd.DataFrame(cleaned_data, columns=df.columns)
print(cleaned_df)

    age   salary
0  25.0  50000.0
1  30.0  60000.0
2  37.5  55000.0
3  45.0  58750.0
4  50.0  70000.0
