## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Step 1: Load a sample dataset
# Let's use the famous Iris dataset CSV from a URL for demonstration
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# The dataset has no header, so we assign column names
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(url, header=None, names=column_names)

# Extract numerical features (all columns except 'class')
X = df.drop(columns=['class'])

# Step 2: Define a pipeline with StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Fit the pipeline and transform the data
X_scaled = pipeline.fit_transform(X)

print("Original data (first 5 rows):")
print(X.head())
print("\nScaled data (first 5 rows):")
print(X_scaled[:5])


Original data (first 5 rows):
   sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2

Scaled data (first 5 rows):
[[-0.90068117  1.03205722 -1.3412724  -1.31297673]
 [-1.14301691 -0.1249576  -1.3412724  -1.31297673]
 [-1.38535265  0.33784833 -1.39813811 -1.31297673]
 [-1.50652052  0.10644536 -1.2844067  -1.31297673]
 [-1.02184904  1.26346019 -1.3412724  -1.31297673]]


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:
# Write your code from here
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Step 1: Load a dataset with missing values
# Let's create a sample DataFrame with missing values for demonstration
data = {
    'age': [25, np.nan, 22, 40, np.nan],
    'salary': [50000, 60000, np.nan, 80000, 70000]
}
df = pd.DataFrame(data)

# Features matrix
X = df

# Step 2: Define a pipeline with SimpleImputer
# We'll fill missing values with the mean of each column
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Fit the pipeline and transform the data
X_imputed = pipeline.fit_transform(X)

print("Original data:")
print(X)
print("\nData after imputation:")
print(X_imputed)


Original data:
    age   salary
0  25.0  50000.0
1   NaN  60000.0
2  22.0      NaN
3  40.0  80000.0
4   NaN  70000.0

Data after imputation:
[[2.5e+01 5.0e+04]
 [2.9e+01 6.0e+04]
 [2.2e+01 6.5e+04]
 [4.0e+01 8.0e+04]
 [2.9e+01 7.0e+04]]
