## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Step 1: Load sample dataset
# Using Iris dataset for demonstration
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Step 2: Define a pipeline with StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply pipeline to scale features
scaled_features = pipeline.fit_transform(df)

# Convert back to DataFrame for inspection
scaled_df = pd.DataFrame(scaled_features, columns=df.columns)

print(scaled_df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [8]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_california_housing
import numpy as np

# Step 1: Load California housing dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Introduce missing values artificially for demonstration (~10%)
np.random.seed(0)
mask = np.random.rand(*df.shape) < 0.1
df[mask] = np.nan

print("Missing values before imputation:\n", df.isnull().sum())

# Step 2: Define pipeline with SimpleImputer to fill missing values (mean)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Fit and transform the dataset
df_imputed = pipeline.fit_transform(df)

# Convert to DataFrame with original columns
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)

print("\nMissing values after imputation:\n", df_imputed.isnull().sum())
print("\nFirst 5 rows of imputed data:\n", df_imputed.head())



Missing values before imputation:
 MedInc         2038
HouseAge       2108
AveRooms       2062
AveBedrms      2033
Population     2074
AveOccup       2014
Latitude       2077
Longitude      2179
MedHouseVal    2055
dtype: int64

Missing values after imputation:
 MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

First 5 rows of imputed data:
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup   Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556  37.880000   
1  8.3014      21.0  6.238137   0.971880      2401.0  3.086691  35.629523   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260  37.850000   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945  37.850000   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467  37.850000   

    Longitude  MedHouseVal  
0 -122.230000        4.526  
1 -119.565726     