## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [6]:
# Write your code from here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Step 1: Load a sample dataset
# Let's use the Iris dataset as an example
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Step 2: Define a pipeline
# Here, we will apply StandardScaler to scale all numerical features
numeric_features = df.columns

# Define a pipeline to scale the data
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Step 3: Fit and transform the data using the pipeline
scaled_data = pipeline.fit_transform(df)

# Step 4: Show the scaled data
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())



   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [7]:
# Write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Step 1: Load a dataset with missing values
# Creating a sample dataset with missing values for illustration
data = {'Feature1': [1, 2, 3, None, 5],
        'Feature2': [None, 2, None, 4, 5]}

df = pd.DataFrame(data)

# Step 2: Define a pipeline
# Here, we will apply SimpleImputer to fill missing values with the mean
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Fit and transform the data using the pipeline
imputed_data = pipeline.fit_transform(df)

# Step 4: Show the imputed data
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)


   Feature1  Feature2
0      1.00  3.666667
1      2.00  2.000000
2      3.00  3.666667
3      2.75  4.000000
4      5.00  5.000000
