## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:
# Write your code from here
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Step 1: Load a sample dataset
data = {
    'Age': np.random.randint(20, 60, size=10),
    'Salary': np.random.randint(30000, 100000, size=10)
}
df = pd.DataFrame(data)
print("📦 Original Dataset:\n")
print(df)

# Step 2: Define a pipeline
scaling_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply pipeline to scale features
scaled_data = scaling_pipeline.fit_transform(df)

# Convert scaled data back to a DataFrame for readability
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print("\n📊 Scaled Dataset using Pipeline:\n")
print(scaled_df)


📦 Original Dataset:

   Age  Salary
0   55   76530
1   40   68044
2   24   47274
3   28   65673
4   30   35004
5   52   39549
6   54   51291
7   26   71665
8   37   85212
9   22   59668

📊 Scaled Dataset using Pipeline:

        Age    Salary
0  1.490202  1.062553
1  0.262014  0.517367
2 -1.048054 -0.817008
3 -0.720537  0.365042
4 -0.556779 -1.605298
5  1.244564 -1.313303
6  1.408323 -0.558934
7 -0.884296  0.750000
8  0.016376  1.620331
9 -1.211813 -0.020751


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:
# Write your code from here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Step 1: Create a sample dataset with missing values
data = {
    'Age': [25, 30, np.nan, 45, 35],
    'Salary': [50000, np.nan, 60000, 80000, np.nan]
}
df = pd.DataFrame(data)
print("📦 Original Dataset with Missing Values:\n")
print(df)

# Step 2: Define a pipeline with imputation
imputation_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))  # Replace missing values with mean
])

# Step 3: Apply the pipeline to the data
imputed_data = imputation_pipeline.fit_transform(df)

# Convert imputed data back to DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("\n🧼 Cleaned Dataset after Imputation:\n")
print(imputed_df)


📦 Original Dataset with Missing Values:

    Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  45.0  80000.0
4  35.0      NaN

🧼 Cleaned Dataset after Imputation:

     Age        Salary
0  25.00  50000.000000
1  30.00  63333.333333
2  33.75  60000.000000
3  45.00  80000.000000
4  35.00  63333.333333
