## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [3]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Step 1: Load a sample dataset (created here manually)
data = pd.DataFrame({
    'Height': [165, 180, 175, 160, 170],
    'Weight': [65, 85, 75, 55, 68],
    'Age': [25, 32, 28, 40, 30]
})

print("Original Data:\n", data)

# Step 2: Define the pipeline with StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply pipeline to scale data
scaled_data = pipeline.fit_transform(data)

# Convert scaled data back to DataFrame for easy viewing
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

print("\nScaled Data:\n", scaled_df)


Original Data:
    Height  Weight  Age
0     165      65   25
1     180      85   32
2     175      75   28
3     160      55   40
4     170      68   30

Scaled Data:
      Height    Weight       Age
0 -0.707107 -0.458535 -1.185854
1  1.414214  1.535096  0.197642
2  0.707107  0.538280 -0.592927
3 -1.414214 -1.455350  1.778781
4  0.000000 -0.159490 -0.197642


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [4]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Step 1: Create a sample dataset with missing values
data = pd.DataFrame({
    'Age': [25, 30, None, 22, 40],
    'Salary': [50000, None, 60000, 52000, None]
})

print("Original Data with Missing Values:\n", data)

# Step 2: Define a pipeline to fill missing values with the mean
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Fit the pipeline and transform the data
imputed_data = pipeline.fit_transform(data)

# Convert back to DataFrame for better readability
imputed_df = pd.DataFrame(imputed_data, columns=data.columns)

print("\nData after Imputation:\n", imputed_df)


Original Data with Missing Values:
     Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  22.0  52000.0
4  40.0      NaN

Data after Imputation:
      Age   Salary
0  25.00  50000.0
1  30.00  54000.0
2  29.25  60000.0
3  22.00  52000.0
4  40.00  54000.0
