## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [6]:
# Write your code from here
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Step 2: Load a sample dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Step 3: Define a pipeline
scaling_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 4: Fit and transform the data using the pipeline
scaled_data = scaling_pipeline.fit_transform(df)

# Step 5: Convert result to DataFrame for readability
scaled_df = pd.DataFrame(scaled_data, columns=iris.feature_names)
print(scaled_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [7]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

# Load Iris data into a DataFrame
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Introduce some missing values randomly for testing
np.random.seed(0)
missing_mask = np.random.rand(*df.shape) < 0.1  # 10% missing
df = df.mask(missing_mask)

print("Before Imputation:")
print(df.head())

# Define pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
df[numerical_columns] = pipeline.fit_transform(df[numerical_columns])

print("\nAfter Imputation:")
print(df.head())

Before Imputation:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                NaN               NaN
4                NaN               3.6                1.4               0.2

After Imputation:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0           5.100000               3.5           1.400000          0.200000
1           4.900000               3.0           1.400000          0.200000
2           4.700000               3.2           1.300000          0.200000
3           4.600000               3.1           3.779699          1.207087
4           5.791304               3.6           1.400000          0.200000
