## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Step 1: Load sample dataset (using Iris dataset here for demonstration)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

print("Original Data (first 5 rows):")
print(df.head())

# Step 2: Define a pipeline with StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply pipeline to scale numerical features
scaled_data = pipeline.fit_transform(df)

# Convert back to DataFrame for readability
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("\nScaled Data (first 5 rows):")
print(scaled_df.head())


Original Data (first 5 rows):
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Scaled Data (first 5 rows):
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Step 1: Create sample dataset with missing values for demonstration
data = {
    'Age': [25, 30, None, 35, 40],
    'Income': [50000, 60000, 55000, None, 65000]
}
df = pd.DataFrame(data)
print("Original Data with Missing Values:")
print(df)

# Step 2: Define a pipeline with SimpleImputer to fill missing values (e.g., mean) and then scale
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with column mean
    ('scaler', StandardScaler())                   # Then scale the data
])

# Step 3: Fit and transform the data using the pipeline
clean_scaled_data = pipeline.fit_transform(df)

# Convert back to DataFrame for easy viewing
clean_scaled_df = pd.DataFrame(clean_scaled_data, columns=df.columns)

print("\nData after Imputation and Scaling:")
print(clean_scaled_df)


Original Data with Missing Values:
    Age   Income
0  25.0  50000.0
1  30.0  60000.0
2   NaN  55000.0
3  35.0      NaN
4  40.0  65000.0

Data after Imputation and Scaling:
   Age  Income
0 -1.5    -1.5
1 -0.5     0.5
2  0.0    -0.5
3  0.5     0.0
4  1.5     1.5
