# Iris ML Pipeline & Preprocessing Automation

**Week 1 Assignment**

This notebook builds a reproducible pipeline to **load → clean → feature engineer → split**, then exports a cleaned dataset. It includes:
- Missing value strategy (documented)
- Scaling/encoding functions
- Reusable `sklearn` Pipeline
- Reproducible seed and run command

Dataset link: https://www.kaggle.com/datasets/uciml/iris

Dataset paths on Kaggle:
- `Iris.csv`: `/kaggle/input/iris/Iris.csv`
- `database.sqlite`: `/kaggle/input/iris/database.sqlite`


## Step 1 — Missing Value Handling (Median Imputation)

Training data medians (used to fill missing numeric values):

- SepalLengthCm median = 5.8  
- SepalWidthCm median = 3.0  
- PetalLengthCm median = 4.3  
- PetalWidthCm median = 1.3  

If a value is missing, it is replaced by the median of that column.

## Step 2 — Scaling (StandardScaler)

Training data statistics:

**Means**
- SepalLengthCm mean = 5.8475  
- SepalWidthCm mean = 3.04  
- PetalLengthCm mean = 3.795  
- PetalWidthCm mean = 1.21333333  

**Scales (Std)**
- SepalLengthCm scale = 0.83466186  
- SepalWidthCm scale = 0.44185216  
- PetalLengthCm scale = 1.74718106  
- PetalWidthCm scale = 0.75508646  

**Formula**
Scaled value = (value − mean) / scale

**Example (if first row was missing, it becomes medians)**
- SepalLengthCm: (5.8 − 5.8475) / 0.83466186 = -0.0569  
- SepalWidthCm: (3.0 − 3.04) / 0.44185216 = -0.0905  
- PetalLengthCm: (4.3 − 3.795) / 1.74718106 = 0.2890  
- PetalWidthCm: (1.3 − 1.21333333) / 0.75508646 = 0.1148  

In [4]:
# Step 1: Imports and settings
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

RANDOM_SEED = 42
TEST_SIZE = 0.2
DATA_PATH = "/kaggle/input/iris/Iris.csv"
OUTPUT_PATH = "cleaned_iris.csv"

MISSING_VALUE_STRATEGY = (
    "Numeric: median imputation. Categorical: most_frequent imputation."
)

print("Missing value strategy:", MISSING_VALUE_STRATEGY)
print("Seed:", RANDOM_SEED)


Missing value strategy: Numeric: median imputation. Categorical: most_frequent imputation.
Seed: 42


In [5]:
# Step 2: Load data
raw_df = pd.read_csv(DATA_PATH)
raw_df.head()


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [6]:
# Step 3: Check missing values
missing_counts = raw_df.isna().sum()
missing_counts


Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [7]:
# Step 4: Basic cleanup (drop Id column) and split features/target
TARGET_COL = "Species"
ID_COL = "Id"

df = raw_df.drop(columns=[ID_COL]) if ID_COL in raw_df.columns else raw_df.copy()
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

X.head()


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [8]:
# Step 5: Build preprocessing pipeline (impute + scale + encode)
num_features = X.select_dtypes(include=["number"]).columns.tolist()
cat_features = [col for col in X.columns if col not in num_features]

num_pipeline = Pipeline(                            #Imputer (median) → missing values fill
                                                   #StandardScaler → values ko scale (mean 0, std 1)
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")), #missing fill
        ("encoder", OneHotEncoder(handle_unknown="ignore")), #convert the categories to numebrs
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_pipeline, num_features),
        ("cat", cat_pipeline, cat_features),
    ],
    remainder="drop",
)

preprocessor


In [9]:
# Step 6: Train/test split (reproducible)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_SEED,
    stratify=y,
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


#It means:
#120 rows went into the training set
#30 rows went into the test set
#So the total 150 rows were split 80% train / 20% test.


Train size: 120
Test size: 30


In [10]:
# Step 7: Fit preprocessor on training data and transform full dataset
preprocessor.fit(X_train)
#Learns how to preprocess using only the training data
#(e.g., medians for imputation, means/std for scaling).

X_processed = preprocessor.transform(X)
#Applies the learned preprocessing to the full dataset (all rows).
feature_names = preprocessor.get_feature_names_out()
#Gets the column names after transformation (e.g., num__SepalLengthCm).
cleaned_df = pd.DataFrame(X_processed, columns=feature_names)
#Builds a new clean DataFrame with the processed values + column names.
cleaned_df[TARGET_COL] = y.values
#Adds the target column (Species) back.

cleaned_df.head()
#Shows the first 5 rows of the cleaned data.

Unnamed: 0,num__SepalLengthCm,num__SepalWidthCm,num__PetalLengthCm,num__PetalWidthCm,Species
0,-0.885662,1.027095,-1.347036,-1.320168,Iris-setosa
1,-1.124492,-0.099517,-1.347036,-1.320168,Iris-setosa
2,-1.363322,0.351127,-1.403853,-1.320168,Iris-setosa
3,-1.482738,0.125805,-1.290219,-1.320168,Iris-setosa
4,-1.005077,1.252417,-1.347036,-1.320168,Iris-setosa


In [11]:
# Step 8: Export cleaned dataset
cleaned_df.to_csv(OUTPUT_PATH, index=False)
print("Saved:", OUTPUT_PATH)


Saved: cleaned_iris.csv
